Bug #57557 for HTML-Tree: dump(*FH) method corrupts (just Hebrew?) utf8 wide characters

Mon May 17 12:15:49 2010 meir [...] guttman.co.il - Ticket created

Subject:	dump(*FH) method corrupts (just Hebrew?) utf8 wide characters
Date:	Mon, 17 May 2010 19:15:23 +0300
To:	bug-HTML-Tree [...] rt.cpan.org
From:	Meir Guttman <meir [...] guttman.co.il>

Hi folks! I am trying to use HTML::Elements->dump to dump a tree of a captured HTML download content which contains Hebrew Unicode characters, utf8 encoded. If I am viewing the wireshark-captured stream in a Unicode supporting text-editor I see all Hebrew characters all right: But when I am trying to use the HTML::Elements->dump(*FH) method as follows: my $response = $browser->get($url_request); my $out_file = "spider.html"; open (OUTFILE, ">:encoding(utf8)", $out_file) or die "Cannot open $out_file, $!\n"; if ($response->is_success) { my $tree = HTML::TreeBuilder -> new_from_content($response->content()) or die "*** Could not process URL"; $tree->dump(*OUTFILE); $tree -> delete; } Then, when I view the "spider.html" file in the very same Unicode supporting text editor I see this: . with all Hebrew characters garbled. Please note that the "300" string of characters is shown correctly. Does the dump method of the HTML::Elements module support Unicode in general and Hebrew in particular? Do I do something wrong? Regards, Meir Guttman Ashdod, Israel

Message body is not shown because it is too large.

Download image003.gif
image/gif 5.5k

Message body is not shown because sender requested not to inline it.

Download image004.gif
image/gif 6.9k

Message body is not shown because sender requested not to inline it.

Mon May 17 12:28:33 2010 meir [...] guttman.co.il - Correspondence added

Subject:	RE: [rt.cpan.org #57557] AutoReply: dump(*FH) method corrupts (just Hebrew?) utf8 wide characters
Date:	Mon, 17 May 2010 19:28:26 +0300
To:	bug-HTML-Tree [...] rt.cpan.org
From:	Meir Guttman <meir [...] guttman.co.il>

Dear folks, Sorry for not providing my environment details: HTML::Tree ========== Ver. 3.23 Perl: ==== perl, v5.10.1 built for MSWin32-x86-multi-thread (with 2 registered patches, see perl -V for more detail) Copyright 1987-2009, Larry Wall Binary build 1006 [291086] provided by ActiveState http://www.ActiveState.com Built Aug 24 2009 13:48:26 Machine: ======= OS Name Microsoft Windows XP Professional Version 5.1.2600 Service Pack 3 Build 2600 Show quoted text

-----Original Message----- From: Bugs in HTML-Tree via RT [mailto:bug-HTML-Tree@rt.cpan.org] Sent: Monday, May 17, 2010 7:16 PM To: meir@guttman.co.il Subject: [rt.cpan.org #57557] AutoReply: dump(*FH) method corrupts (just Hebrew?) utf8 wide characters Greetings, This message has been automatically generated in response to the creation of a trouble ticket regarding: "dump(*FH) method corrupts (just Hebrew?) utf8 wide characters", a summary of which appears below. There is no need to reply to this message right now. Your ticket has been assigned an ID of [rt.cpan.org #57557]. Your ticket is accessible on the web at: https://rt.cpan.org/Ticket/Display.html?id=57557 Please include the string: [rt.cpan.org #57557] in the subject line of all future correspondence about this issue. To do so, you may reply to this message. Thank you, bug-HTML-Tree@rt.cpan.org ------------------------------------------------------------------------- Hi folks! I am trying to use HTML::Elements->dump to dump a tree of a captured HTML download content which contains Hebrew Unicode characters, utf8 encoded. If I am viewing the wireshark-captured stream in a Unicode supporting text-editor I see all Hebrew characters all right: But when I am trying to use the HTML::Elements->dump(*FH) method as follows: my $response = $browser->get($url_request); my $out_file = "spider.html"; open (OUTFILE, ">:encoding(utf8)", $out_file) or die "Cannot open $out_file, $!\n"; if ($response->is_success) { my $tree = HTML::TreeBuilder -> new_from_content($response->content()) or die "*** Could not process URL"; $tree->dump(*OUTFILE); $tree -> delete; } Then, when I view the "spider.html" file in the very same Unicode supporting text editor I see this: . with all Hebrew characters garbled. Please note that the "300" string of characters is shown correctly. Does the dump method of the HTML::Elements module support Unicode in general and Hebrew in particular? Do I do something wrong? Regards, Meir Guttman Ashdod, Israel

Tue May 18 23:21:55 2010 Jeff.Fearn [...] gmail.com - Taken

Tue May 18 23:28:26 2010 Jeff.Fearn [...] gmail.com - Correspondence added

Hi Meir, UTF8 isn't enabled in perl by default, so the encoding is probably getting lost somewhere in the process. Have you switched utf8 on somewhere else in the script? Could you supply a script with just the bits you use to fetch the html, preferably from a public URL, and call dump, in it? That way I can poke around and see where it's going wrong. Cheers, Jeff.

Tue May 18 23:28:27 2010 The RT System itself - Status changed from 'new' to 'open'

Wed May 19 09:37:51 2010 meir [...] guttman.co.il - Correspondence added

Subject:	RE: [rt.cpan.org #57557] dump(*FH) method corrupts (just Hebrew?) utf8 wide characters
Date:	Wed, 19 May 2010 16:37:40 +0300
To:	bug-HTML-Tree [...] rt.cpan.org
From:	Meir Guttman <meir [...] guttman.co.il>

Dear Jeff, Thanks for your reply! I am attaching a zip file with three items: * A pared down Perl script with the dump method invoked (It is originally a callable sub with params.) * The file generated by this dump * The very same HTTP exchange, as captured and saved by WireShark. Please note that I declared utf8 as the script encoding and that I invoked general utf8 support for all I/O. But nevertheless I then went and invoked utf8 explicitly for the dump file. (I hope I did all this right. I have just a little over six month of Perl Experience...) The URL here is usually a lot more complex and is dynamically compiled, but even this static compilation is enough to show the problem. Regards Meir

Download HTTP-Tree_dump.zip
application/x-zip-compressed 6.1k

Message body not shown because it is not plain text.

Sun May 23 20:49:59 2010 Jeff.Fearn [...] gmail.com - Correspondence added

Hi Meir, I've had a look in to this and noticed a few things: A: The string returned from $response->content() is not UTF8 encoded. Checked this using Encode::is_utf8. This could be due to the source not containing the content-type meta tag. B: The dump method uses substr which seems to further affect the strings encoding. C: Dump is a function to display the HTML in a truncated form for debugging purposes, you may want to use as_HTML. Note that as_HTML will try and decode all the utf8 characters unless you tell it not too. I got it kind of working by applying the following patch, and then adding the meta tag specifying it's UTF8 to the output. diff -ur HTTP-Tree_dump/Spider_tree_dump.pl HTTP-Tree_dump-test/Spider_tree_dump.pl --- HTTP-Tree_dump/Spider_tree_dump.pl 2010-05-19 16:14:42.000000000 +1000 +++ HTTP-Tree_dump-test/Spider_tree_dump.pl 2010-05-24 10:12:51.116192527 +1000 @@ -13,10 +13,12 @@ my $out_file = "Magna_spider_testing.html"; open (OUTFILE, ">:encoding(utf8)", $out_file) or die "Cannot open $out_file, $!\n"; # but here I added utf8 support explicitly +binmode OUTFILE; if ($response->is_success) { my $tree = HTML::TreeBuilder -> new_from_content($response->content()) or die "*** Could not process URL"; - $tree->dump(*OUTFILE); +# $tree->dump(*OUTFILE); + print OUTFILE $tree->as_HTML('<>&"'); $tree -> delete; } else { Then add the following to the output file: <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> I'll take a further look when I get another chance. Cheers, Jeff.

Fri Jun 01 00:12:40 2012 cjm [...] cpan.org - Correspondence added

This is not a bug in HTML-Tree, but rather a misunderstanding of how LWP works. The "content" method of HTTP::Message returns BYTES, exactly as the server sent them. It might even be gzip-compressed content. In your case, it wasn't compressed, but it was UTF-8. However, HTML-Tree interpreted it as ISO-8859-1, because that's what Perl does when you give it bytes and don't tell it what encoding they are. The solution is simple. Use "$response->decoded_content" instead of "$response->content". You should almost never use "content" when you're expecting a textual response. The "decoded_content" method uses the charset specified by the server (either in a header or in a <meta> tag) to decode the bytes into characters. You'll need HTTP::Message 5.802 or newer to get "decoded_content". With that change, your script works fine for me.

Fri Jun 01 00:12:41 2012 cjm [...] cpan.org - Status changed from 'open' to 'rejected'