Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 14212
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: dma_k [...] mail.ru
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: HTML::TreeBuilder generates text nodes in a strange encoding
I am using perl-HTML-Tree-3.18. I have met the following problem: When I use HTML::TreeBuilder to parse a tree, that contains the text like "Gebühr vor Ort von € 30,- pro Woche" (without quotes), I will get the string in the strange encoding: ü will be encoded as one char, € will be encoded as two chars. I think, that is incorrect.
From: dma_k [...] mail.ru
In the above post the string should be read as: "Gebühr vor Ort von € 30,- pro Woche" Tree builder seems to decode the string entities via HTML::Entities. Is it possible to extend the tree builder with an option, that allows to skip encoding the HTML entities into chars? The only way out seems to call encode again, but that is not pretty.
From: dma_k [...] mail.ru
The problem seems to be solved, when upgraded form Perl v5.8.3 to v5.8.6.
From: john
The Debian stable folk have 5.8.4, and this bug is affecting their programs. How could they go around this problem? I'm looking at the source code and I'm tempted to comment out the HTML::Entities::encode line... but would that then create other problems?
Can't reproduce with 3.18 and up. Please resubmit with a test case if you are still having this issue. As an aside, I have added this case as a test in HTML-Tree 3.22, which will be released as part of the Chicago Hackathon this weekend.
From: dma_k [...] mail.ru
Hello! Thanks that you've paid attention to the (possible) problem. Finally, as I said above, some of perl installations work, some -- not, and I've come to the conclusion, it's a core Perl bug with unicode chars. What version of Perl do you use for testing? Can you please, define more precisely the return value for "HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized text? While investigating the problem, I've read http://jerakeen.org/files/2005/perl-utf8.slides.pdf -- it has a very nice chart. Consider for reading! Actually, I've found this problem, while implemnting the HTML parser, that stores the data to the MySQL DB, and this data is supposed to be displayed as HTML again. So, in my case I used the following flow: my $html_root = HTML::TreeBuilder->new_from_content($contents); foreach ($html_root->guts()) { ... $dbh->prepare("insert into my_table (id, contents) values ($id, ?)")->execute(HTML::Entities::encode_entities($_->as_trimmed_text())); } so I used the "reverse" convertion for chars. Unfortunately, I still don;t have any working example to store Unicode strings into MySQL 4.0.x from Perl to be read later correctly from Java :( but that's out of the scope of the problem, being discussed.
From: PETEK [...] cpan.org
On Mon Nov 13 04:50:27 2006, dma_k@mail.ru wrote: Show quoted text
> Finally, as I said above, some of perl installations work, some -- not, > and I've come to the conclusion, it's a core Perl bug with unicode > chars. What version of Perl do you use for testing?
I use Apple's Perl (5.8.6 on OSX), Debian sarge's Perl (5.8.4), and a custom Perl (5.8.2) for release testing. I do have a 5.6 install sitting around, and t/body.t fails on unicode escape tests. (I should skip those on that platform.) Show quoted text
> Can you please, define more precisely the return value for > "HTML::Entity->as_text()"? Should it return the UTF-8 text? Localized > text?
It returns the text exactly as it's contained in each HTML::Element (not HTML::Entity) and children. If that's UTF-8, Unicode, ISO-8859-1, or whatever, that's been decided by HTML::Parser. HTML::Element is just the middleman, doing simple concatenation. If you could give a test case that shows the broken behavior on your platform, I would appreciate it.
I'm not sure if this module is still being actively maintained but I am experiencing the same issues on perl 5.10. I don't know if the issue is with HTML::Element or an underlying module. Here is a test case which fails on Fedora 9 platform: ============================== #!/usr/bin/perl use HTML::Element; use Test::More tests => 2; my $test_string = 'This is a test 漢語'; like( $test_string, qr/漢語/xms, 'Found chinese chars input string' ); my $h = HTML::Element->new( 'p' ); $h->push_content('This is a test 漢語'); like( $h->as_HTML, qr/漢語/xms, 'Found chinese chars in html output' ); ======================== Running this on Fedora 9 produces the following output: 1..2 ok 1 - Found chinese chars input string not ok 2 - Found chinese chars in html output # Failed test 'Found chinese chars in html output' # at ./test2.pl line 13. # '<p>This is a test &aelig;&frac14;&cent;&egrave;&ordf;&#158; # ' # doesn't match '(?msx-i:漢語)' # Looks like you failed 1 test of 2.
Sorry, not sure if I was experiencing the same issue as described above, but it seemed the same. Just realized that passing empty string to as_HTML solves this issue. Updated test case, which passes: =================================== #!/usr/bin/perl use HTML::Element; use Test::More tests => 2; my $test_string = 'This is a test 漢語'; like( $test_string, qr/漢語/xms, 'Found chinese chars input string' ); my $h = HTML::Element->new( 'p' ); $h->push_content('This is a test 漢語'); like( $h->as_HTML( '' ), qr/漢語/xms, 'Found chinese chars in html output' ); =================================== 1..2 ok 1 - Found chinese chars input string ok 2 - Found chinese chars in html output
From: dma_k [...] mail.ru
Using as_HTML('') is funny, because in this case you tell HTML::Element not to encode entities at all (the default should be '<>&'). Why do you expect that as_HTML() should return a non-HTML-encoded string back? I would use as_text() for this case. Or you mean that as_HTML() basically does incorrect HTML-encoding for Chinese characters? Try plain with first argument, seems to be a bug but of the different nature.
This is a bug in HTML::Entities, line 479 is encoding the Chinese characters. Adding the following debug code to HTML/Entities.pm reveals this: print(STDERR "1: ref = $$ref\n"); $$ref =~ s/([^\n\r\t !\#\$%\(-;=?-~])/$char2entity{$1} || num_entity($1)/ge; print(STDERR "2: ref = $$ref\n"); 1: ref = This is a test 漢語 2: ref = This is a test &aelig;&frac14;&cent;&egrave;&ordf;&#158; Cheers, Jeff.
RT-Send-CC: PETEK [...] cpan.org
From you example I can't tell if the string you passed to HTML::Entities::encode() was a Unicode string or the decoded UTF-8 bytes. Please try the attached test program. It prints: # encode-test.pl:4: "This is a test \x{6F22}\x{8A9E}" # encode-test.pl:5: "This is a test &#x6F22;&#x8A9E;" for me, so it seems correct. If I comment out the 'use utf8;' line then the output becomes: # encode-test.pl:4: "This is a test \xE6\xBC\xA2\xE8\xAA\x9E" # encode-test.pl:5: "This is a test &aelig;&frac14;&cent;&egrave;&ordf;&#158;" It you get different results, please tell me what version of perl and HTML::Parser you are using. If you get the result above then I don't consider this a bug.
On Fri Jul 09 09:16:30 2010, GAAS wrote: Show quoted text
> Please try the attached test program. It prints:
Of course, I forgot to attach the file :-(
Subject: encode-test.pl
#use utf8; use Data::Dump; use HTML::Entities; ddx $text = "This is a test 漢語"; ddx $enc = HTML::Entities::encode($text);