Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 99755
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: KNI [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: A problem with utf8.
Hello. Why the following code print this: » › <bb> while it should print this: » › » To see the difference use a file editor, not the utf8 console. use HTML::Parser; my $html = "<p>&raquo; &rsaquo;</p><p>&raquo;</p>"; my $prs = HTML::Parser->new(api_version => 3, utf8_mode => 1); $prs->handler(text => sub { my ($prs, $text) = @_; print $text, "\n"; }, "self,dtext"); $prs->parse($html);
From: Mark.Martinec [...] ijs.si
This is definitely a bug, making useless the utf8_mode of parsing an HTML text encoded as UTF-8 octets (or mildly said: unreliable). The problem stems from the fact that utf8_mode should be able to force HTML::Entities to always return entities as UTF-8 octets, yet this is currently not possible. The last comment on a sister bug 73751 came to the same conclusion: https://rt.cpan.org/Public/Bug/Display.html?id=73751 So what happens here? Given a UTF-8 encoded text (utf8 flag off) with some HTML entities, those entities with code above 255 are correctly represented as a UTF-8 sequence of octets and blends naturally with the remaining UTF-8 text (utf8 flag still off). However an entity with a Unicode codepoint in the 128..255 range ends up as a single octet (Latin-1, utf8 flag off). So when this octet is concatenated with the rest of the text, it is no longer distinguishable from octets in the 128..255 range that are part of other valid UTF-8 characters. So the resulting string is a string of octets in mixed UTF-8 and Latin-1 encodings. Characters that came from original UTF-8 text and those that came from entities with high Unicode codepoints are in UTF-8, but those octets that came from entities with low codepoins are encoded in Latin-1. No matter in what encoding one looks at the result, some characters are always mangled. Here is another example (sorry for a sample text taken from an advertisement which prompted me to search for this bug report): use Encode; use HTML::Parser; my $html_utf8 = "<p>GOOD: R\xC3\xA9ductions jusqu'&agrave; -70%. ". "<p>BROKEN: R&eacute;ductions jusqu'&agrave; -70%. ". "<p>GOOD: R&eacute;ductions jusqu'&agrave; -70%. Offerts de 9 &euro; ". "<p>GOOD: Offerts de 9 &euro; "; sub html_text { my($self, $text) = @_; printf("%d %s\n", Encode::is_utf8($text), $text); } my $p = HTML::Parser->new( api_version => 3, utf8_mode => 1, handlers => [ text => [\&html_text, "self,dtext"] ], ); $p->parse($html_utf8); results in: 0 GOOD: Réductions jusqu'à -70%. 0 BROKEN: R�ductions jusqu'� -70%. 0 GOOD: Réductions jusqu'à -70%. Offerts de 9 € 0 GOOD: Offerts de 9 € Note that leaving out a <P> tag (case #3) avoids the problem: having the adjacent &euro; in the same paragraph forces the &agrave; to be correctly encoded as a pair of UTF-8 octets.
I agree that this is a bug.  I have not had time to look at it yet. Patches welcome.
Now fixed in commit ac31d36a16