Skip Menu |

This queue is for tickets about the libwww-perl CPAN distribution.

Report information
The Basics
Id: 54361
Status: resolved
Priority: 0/
Queue: libwww-perl

People
Owner: Nobody in particular
Requestors: scop [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 5.834
Fixed in: (no value)



Subject: Head parsing does not work with all (supported) encodings
LWP's head parsing does not work with all (supported) character encodings. For example for http://koti.welho.com/vskytta/utf16le.html (served as text/html without charset parameter, UTF-16LE with byte-order mark, file also attached) the "Title" header does not get populated but a warning "Parsing of undecoded UTF-16 at [...]" is emitted. I think the same would happen if there was no BOM but the encoding was specified in the Content-Type header's charset parameter but I have no test case available for that at the moment. I suppose some kind of preprocessing would be needed before docs are fed to HTML::HeadParser in parse_head(). New versions of LWP seem to have the necessary functionality for this preprocessing/decoding, it should just be used here. (Aside, I suppose HTML::(Head)Parser could just handle the BOM cases automatically as it already recognizes the BOMs...)
Subject: utf16le.html
<html xmlns="http://www.w3.org/1999/xhtml"><head> <title>Hello, Title!</title> </head> <body> <p>Hello</p> </body> </html>
The problem is the LWP does not provide any interface to stream out decoded data. What I would like is a handler similar to 'response_data' that provided the same data that the 'decoded_content' would return. I would name this handler 'response_decoded_data'. With that I would just make the head parser consume that stream. Problem is that we need sniffing and charset guessing based on the actual data to get a practical solution. The handler driver would have to first accumulate enough data for it to be pretty confident about what charset to use during decoding. BTW, the current head parsing also get into trouble if there is any Content-Encoding applied to the response.
I have just discovered that WWW::Mechanize appears not to correctly notice the content of the <base ...> tag inside <head> in deflated content. That is, my Web server is serving gzipped content with <base ...> in the head, and $mech->base() is wrong, but when I reconfigure the Web server to disable gzipping, $mech->base() is correct. I think this is perhaps related to this bug, which is why I'm commenting here rather than filing an new bug. Am I right?
I just ran into the issue with head parsing and content-encodings. The attached simplistic patch (against LWP 5.837) fixes the issue for me, but I haven't thought much at all about possible side effects, nor if some additional error checking should be added (e.g. against possible undef from decoded_content?) Anyway, here's a simple test case: use LWP::UserAgent; use HTTP::Request; my $ua = LWP::UserAgent->new(); my $req = HTTP::Request->new( "GET", "http://qa-dev.w3.org/link-testsuite/base-3.php"); $req->accept_decodable(); my $res = $ua->request($req); print "Expected base: http://qa-dev.w3.org/link-testsuite/trap/\n"; print " Actual base: ", $res->base, "\n"; When it fails $res->base ends up as http://www.w3.org/QA/Tools/ (the Content-Location header, not the <base href> from the doc as it should).
Subject: parse_head-decoded_content.patch
--- UserAgent.pm~ 2011-04-30 01:26:19.076765041 +0300 +++ UserAgent.pm 2011-05-23 22:10:42.411220038 +0300 @@ -594,10 +594,10 @@ $parser->xml_mode(1) if $response->content_is_xhtml; $parser->utf8_mode(1) if $] >= 5.008 && $HTML::Parser::VERSION >= 3.40; - push(@{$response->{handlers}{response_data}}, { + push(@{$response->{handlers}{response_done}}, { callback => sub { return unless $parser; - unless ($parser->parse($_[3])) { + unless ($parser->parse($_[0]->decoded_content)) { my $h = $parser->header; my $r = $_[0]; for my $f ($h->header_field_names) {
On Mon May 23 15:21:54 2011, SCOP wrote: Show quoted text
> I just ran into the issue with head parsing and content-encodings. > > The attached simplistic patch (against LWP 5.837) fixes the issue for > me, but I haven't thought much at all about possible side effects.
The most obvious problem is that this will leave no data for HTML::HeadParser to examine when content is sent to a file or a callback (that does not append the data to the response itself).