Bug #99755 for HTML-Parser: A problem with utf8.

Thu Oct 23 08:47:06 2014 KNI [...] cpan.org - Ticket created

Subject:

A problem with utf8.

Hello. Why the following code print this: » › <bb> while it should print this: » › » To see the difference use a file editor, not the utf8 console. use HTML::Parser; my $html = "» &rsaquo;»"; my $prs = HTML::Parser->new(api_version => 3, utf8_mode => 1); $prs->handler(text => sub { my ($prs, $text) = @_; print $text, "\n"; }, "self,dtext"); $prs->parse($html);

Mon Feb 09 14:39:12 2015 Mark.Martinec [...] ijs.si - Correspondence added

From:

Mark.Martinec [...] ijs.si

This is definitely a bug, making useless the utf8_mode of parsing an HTML text encoded as UTF-8 octets (or mildly said: unreliable). The problem stems from the fact that utf8_mode should be able to force HTML::Entities to always return entities as UTF-8 octets, yet this is currently not possible. The last comment on a sister bug 73751 came to the same conclusion: https://rt.cpan.org/Public/Bug/Display.html?id=73751 So what happens here? Given a UTF-8 encoded text (utf8 flag off) with some HTML entities, those entities with code above 255 are correctly represented as a UTF-8 sequence of octets and blends naturally with the remaining UTF-8 text (utf8 flag still off). However an entity with a Unicode codepoint in the 128..255 range ends up as a single octet (Latin-1, utf8 flag off). So when this octet is concatenated with the rest of the text, it is no longer distinguishable from octets in the 128..255 range that are part of other valid UTF-8 characters. So the resulting string is a string of octets in mixed UTF-8 and Latin-1 encodings. Characters that came from original UTF-8 text and those that came from entities with high Unicode codepoints are in UTF-8, but those octets that came from entities with low codepoins are encoded in Latin-1. No matter in what encoding one looks at the result, some characters are always mangled. Here is another example (sorry for a sample text taken from an advertisement which prompted me to search for this bug report): use Encode; use HTML::Parser; my $html_utf8 = "GOOD: R\xC3\xA9ductions jusqu'à -70%. ". "BROKEN: Réductions jusqu'à -70%. ". "GOOD: Réductions jusqu'à -70%. Offerts de 9 € ". "GOOD: Offerts de 9 € "; sub html_text { my($self, $text) = @_; printf("%d %s\n", Encode::is_utf8($text), $text); } my $p = HTML::Parser->new( api_version => 3, utf8_mode => 1, handlers => [ text => [\&html_text, "self,dtext"] ], ); $p->parse($html_utf8); results in: 0 GOOD: Réductions jusqu'à -70%. 0 BROKEN: R�ductions jusqu'� -70%. 0 GOOD: Réductions jusqu'à -70%. Offerts de 9 € 0 GOOD: Offerts de 9 € Note that leaving out a tag (case #3) avoids the problem: having the adjacent € in the same paragraph forces the à to be correctly encoded as a pair of UTF-8 octets.

Mon Feb 09 14:39:12 2015 The RT System itself - Status changed from 'new' to 'open'

Tue Feb 10 14:36:10 2015 GAAS [...] cpan.org - Correspondence added

I agree that this is a bug. I have not had time to look at it yet. Patches welcome.

Tue Jan 19 11:49:16 2016 GAAS [...] cpan.org - Correspondence added

Now fixed in commit ac31d36a16

Tue Jan 19 11:49:18 2016 GAAS [...] cpan.org - Status changed from 'open' to 'resolved'