Bug #14212 for HTML-Parser: HTML::TreeBuilder generates text nodes in a strange encoding

Wed Aug 17 10:31:39 2005 Guest - Ticket created

Subject:

HTML::TreeBuilder generates text nodes in a strange encoding

I am using perl-HTML-Tree-3.18. I have met the following problem: When I use HTML::TreeBuilder to parse a tree, that contains the text like "Gebühr vor Ort von € 30,- pro Woche" (without quotes), I will get the string in the strange encoding: ü will be encoded as one char, € will be encoded as two chars. I think, that is incorrect.

Wed Aug 17 10:42:14 2005 Guest - Correspondence added

From:

dma_k [...] mail.ru

In the above post the string should be read as: "Geb&uuml;hr vor Ort von &euro; 30,- pro Woche" Tree builder seems to decode the string entities via HTML::Entities. Is it possible to extend the tree builder with an option, that allows to skip encoding the HTML entities into chars? The only way out seems to call encode again, but that is not pretty.

Thu Sep 08 15:32:13 2005 Guest - Correspondence added

From:

dma_k [...] mail.ru

The problem seems to be solved, when upgraded form Perl v5.8.3 to v5.8.6.

Thu Oct 06 14:37:33 2005 Guest - Correspondence added

From:

john

The Debian stable folk have 5.8.4, and this bug is affecting their programs. How could they go around this problem? I'm looking at the source code and I'm tempted to comment out the HTML::Entities::encode line... but would that then create other problems?

Sat Nov 11 18:13:43 2006 PETEK [...] cpan.org - Correspondence added

Can't reproduce with 3.18 and up. Please resubmit with a test case if you are still having this issue. As an aside, I have added this case as a test in HTML-Tree 3.22, which will be released as part of the Chicago Hackathon this weekend.

Sat Nov 11 18:13:43 2006 The RT System itself - Status changed from 'new' to 'open'

Sat Nov 11 18:13:44 2006 PETEK [...] cpan.org - Status changed from 'open' to 'resolved'

Mon Nov 13 04:50:27 2006 dma_k [...] mail.ru - Correspondence added

From:

dma_k [...] mail.ru

Hello! Thanks that you've paid attention to the (possible) problem. Finally, as I said above, some of perl installations work, some -- not, and I've come to the conclusion, it's a core Perl bug with unicode chars. What version of Perl do you use for testing? Can you please, define more precisely the return value for "HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized text? While investigating the problem, I've read http://jerakeen.org/files/2005/perl-utf8.slides.pdf -- it has a very nice chart. Consider for reading! Actually, I've found this problem, while implemnting the HTML parser, that stores the data to the MySQL DB, and this data is supposed to be displayed as HTML again. So, in my case I used the following flow: my $html_root = HTML::TreeBuilder->new_from_content($contents); foreach ($html_root->guts()) { ... $dbh->prepare("insert into my_table (id, contents) values ($id, ?)")->execute(HTML::Entities::encode_entities($_->as_trimmed_text())); } so I used the "reverse" convertion for chars. Unfortunately, I still don;t have any working example to store Unicode strings into MySQL 4.0.x from Perl to be read later correctly from Java :( but that's out of the scope of the problem, being discussed.

Mon Nov 13 04:50:28 2006 The RT System itself - Status changed from 'resolved' to 'open'

Mon Nov 13 11:29:30 2006 PETEK [...] cpan.org - Correspondence added

From:

PETEK [...] cpan.org

On Mon Nov 13 04:50:27 2006, dma_k@mail.ru wrote: Show quoted text

> Finally, as I said above, some of perl installations work, some -- not, > and I've come to the conclusion, it's a core Perl bug with unicode > chars. What version of Perl do you use for testing?

I use Apple's Perl (5.8.6 on OSX), Debian sarge's Perl (5.8.4), and a custom Perl (5.8.2) for release testing. I do have a 5.6 install sitting around, and t/body.t fails on unicode escape tests. (I should skip those on that platform.) Show quoted text

> Can you please, define more precisely the return value for > "HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized > text?

It returns the text exactly as it's contained in each HTML::Element (not HTML::Entity) and children. If that's UTF-8, Unicode, ISO-8859-1, or whatever, that's been decided by HTML::Parser. HTML::Element is just the middleman, doing simple concatenation. If you could give a test case that shows the broken behavior on your platform, I would appreciate it.

Mon Nov 13 11:29:47 2006 PETEK [...] cpan.org - Status changed from 'open' to 'stalled'

Mon Nov 23 18:19:09 2009 STOCKS [...] cpan.org - Correspondence added

I'm not sure if this module is still being actively maintained but I am experiencing the same issues on perl 5.10. I don't know if the issue is with HTML::Element or an underlying module. Here is a test case which fails on Fedora 9 platform: ============================== #!/usr/bin/perl use HTML::Element; use Test::More tests => 2; my $test_string = 'This is a test 漢語'; like( $test_string, qr/漢語/xms, 'Found chinese chars input string' ); my $h = HTML::Element->new( 'p' ); $h->push_content('This is a test 漢語'); like( $h->as_HTML, qr/漢語/xms, 'Found chinese chars in html output' ); ======================== Running this on Fedora 9 produces the following output: 1..2 ok 1 - Found chinese chars input string not ok 2 - Found chinese chars in html output # Failed test 'Found chinese chars in html output' # at ./test2.pl line 13. # '<p>This is a test æ¼¢èª # ' # doesn't match '(?msx-i:漢語)' # Looks like you failed 1 test of 2.

Mon Nov 23 18:19:10 2009 The RT System itself - Status changed from 'stalled' to 'open'

Mon Nov 23 18:28:17 2009 STOCKS [...] cpan.org - Correspondence added

Sorry, not sure if I was experiencing the same issue as described above, but it seemed the same. Just realized that passing empty string to as_HTML solves this issue. Updated test case, which passes: =================================== #!/usr/bin/perl use HTML::Element; use Test::More tests => 2; my $test_string = 'This is a test 漢語'; like( $test_string, qr/漢語/xms, 'Found chinese chars input string' ); my $h = HTML::Element->new( 'p' ); $h->push_content('This is a test 漢語'); like( $h->as_HTML( '' ), qr/漢語/xms, 'Found chinese chars in html output' ); =================================== 1..2 ok 1 - Found chinese chars input string ok 2 - Found chinese chars in html output

Tue Nov 24 05:38:26 2009 dma_k [...] mail.ru - Correspondence added

From:

dma_k [...] mail.ru

Using as_HTML('') is funny, because in this case you tell HTML::Element not to encode entities at all (the default should be '<>&'). Why do you expect that as_HTML() should return a non-HTML-encoded string back? I would use as_text() for this case. Or you mean that as_HTML() basically does incorrect HTML-encoding for Chinese characters? Try plain with first argument, seems to be a bug but of the different nature.

Sat Apr 24 00:17:21 2010 Jeff.Fearn [...] gmail.com - Correspondence added

This is a bug in HTML::Entities, line 479 is encoding the Chinese characters. Adding the following debug code to HTML/Entities.pm reveals this: print(STDERR "1: ref = $$ref\n"); $$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge; print(STDERR "2: ref = $$ref\n"); 1: ref = This is a test 漢語 2: ref = This is a test æ¼¢èª Cheers, Jeff.

Sat Apr 24 00:18:28 2010 Jeff.Fearn [...] gmail.com - Queue changed from HTML-Tree to HTML-Parser

Fri Jul 09 09:16:30 2010 GAAS [...] cpan.org - Correspondence added

RT-Send-CC:

PETEK [...] cpan.org

From you example I can't tell if the string you passed to HTML::Entities::encode() was a Unicode string or the decoded UTF-8 bytes. Please try the attached test program. It prints: # encode-test.pl:4: "This is a test \x{6F22}\x{8A9E}" # encode-test.pl:5: "This is a test 漢語" for me, so it seems correct. If I comment out the 'use utf8;' line then the output becomes: # encode-test.pl:4: "This is a test \xE6\xBC\xA2\xE8\xAA\x9E" # encode-test.pl:5: "This is a test æ¼¢èª" It you get different results, please tell me what version of perl and HTML::Parser you are using. If you get the result above then I don't consider this a bug.

Fri Jul 09 09:17:49 2010 GAAS [...] cpan.org - Correspondence added

On Fri Jul 09 09:16:30 2010, GAAS wrote: Show quoted text

> Please try the attached test program. It prints:

Of course, I forgot to attach the file :-(

Subject:

encode-test.pl

#use utf8; use Data::Dump; use HTML::Entities; ddx $text = "This is a test æ¼¢èª"; ddx $enc = HTML::Entities::encode($text);

Mon Aug 24 14:36:31 2020 olaf [...] wundersolutions.com - Correspondence added

Ticket migrated to github as https://github.com/libwww-perl/HTML-Parser/issues/10

Mon Aug 24 14:36:32 2020 olaf [...] wundersolutions.com - Status changed from 'open' to 'resolved'