Bug #54361 for libwww-perl: Head parsing does not work with all (supported) encodings

Sat Feb 06 06:24:02 2010 scop [...] cpan.org - Ticket created

Subject:

Head parsing does not work with all (supported) encodings

LWP's head parsing does not work with all (supported) character encodings. For example for http://koti.welho.com/vskytta/utf16le.html (served as text/html without charset parameter, UTF-16LE with byte-order mark, file also attached) the "Title" header does not get populated but a warning "Parsing of undecoded UTF-16 at [...]" is emitted. I think the same would happen if there was no BOM but the encoding was specified in the Content-Type header's charset parameter but I have no test case available for that at the moment. I suppose some kind of preprocessing would be needed before docs are fed to HTML::HeadParser in parse_head(). New versions of LWP seem to have the necessary functionality for this preprocessing/decoding, it should just be used here. (Aside, I suppose HTML::(Head)Parser could just handle the BOM cases automatically as it already recognizes the BOMs...)

Subject:

utf16le.html

<html xmlns="http://www.w3.org/1999/xhtml"><head> <title>Hello, Title!</title> </head> <body> <p>Hello</p> </body> </html>

Sat Feb 06 09:58:59 2010 GAAS [...] cpan.org - Correspondence added

The problem is the LWP does not provide any interface to stream out decoded data. What I would like is a handler similar to 'response_data' that provided the same data that the 'decoded_content' would return. I would name this handler 'response_decoded_data'. With that I would just make the head parser consume that stream. Problem is that we need sniffing and charset guessing based on the actual data to get a practical solution. The handler driver would have to first accumulate enough data for it to be pretty confident about what charset to use during decoding. BTW, the current head parsing also get into trouble if there is any Content-Encoding applied to the response.

Sat Feb 06 09:59:00 2010 The RT System itself - Status changed from 'new' to 'open'

Mon Oct 18 11:53:49 2010 jik [...] kamens.brookline.ma.us - Correspondence added

I have just discovered that WWW::Mechanize appears not to correctly notice the content of the <base ...> tag inside <head> in deflated content. That is, my Web server is serving gzipped content with <base ...> in the head, and $mech->base() is wrong, but when I reconfigure the Web server to disable gzipping, $mech->base() is correct. I think this is perhaps related to this bug, which is why I'm commenting here rather than filing an new bug. Am I right?

Mon May 23 15:21:54 2011 scop [...] cpan.org - Correspondence added

I just ran into the issue with head parsing and content-encodings. The attached simplistic patch (against LWP 5.837) fixes the issue for me, but I haven't thought much at all about possible side effects, nor if some additional error checking should be added (e.g. against possible undef from decoded_content?) Anyway, here's a simple test case: use LWP::UserAgent; use HTTP::Request; my $ua = LWP::UserAgent->new(); my $req = HTTP::Request->new( "GET", "http://qa-dev.w3.org/link-testsuite/base-3.php"); $req->accept_decodable(); my $res = $ua->request($req); print "Expected base: http://qa-dev.w3.org/link-testsuite/trap/\n"; print " Actual base: ", $res->base, "\n"; When it fails $res->base ends up as http://www.w3.org/QA/Tools/ (the Content-Location header, not the <base href> from the doc as it should).

Subject:

parse_head-decoded_content.patch

--- UserAgent.pm~ 2011-04-30 01:26:19.076765041 +0300 +++ UserAgent.pm 2011-05-23 22:10:42.411220038 +0300 @@ -594,10 +594,10 @@ $parser->xml_mode(1) if $response->content_is_xhtml; $parser->utf8_mode(1) if $] >= 5.008 && $HTML::Parser::VERSION >= 3.40; - push(@{$response->{handlers}{response_data}}, { + push(@{$response->{handlers}{response_done}}, { callback => sub { return unless $parser; - unless ($parser->parse($_[3])) { + unless ($parser->parse($_[0]->decoded_content)) { my $h = $parser->header; my $r = $_[0]; for my $f ($h->header_field_names) {

Mon May 23 16:36:14 2011 GAAS [...] cpan.org - Correspondence added

On Mon May 23 15:21:54 2011, SCOP wrote: Show quoted text

> I just ran into the issue with head parsing and content-encodings. > > The attached simplistic patch (against LWP 5.837) fixes the issue for > me, but I haven't thought much at all about possible side effects.

The most obvious problem is that this will leave no data for HTML::HeadParser to examine when content is sent to a file or a callback (that does not append the data to the response itself).

Fri Mar 31 15:00:34 2017 olaf [...] wundersolutions.com - Correspondence added

Ticket migrated to github as https://github.com/libwww-perl/libwww-perl/issues/216

Fri Mar 31 15:00:40 2017 olaf [...] wundersolutions.com - Status changed from 'open' to 'resolved'