This is definitely a bug, making useless the utf8_mode of parsing
an HTML text encoded as UTF-8 octets (or mildly said: unreliable).
The problem stems from the fact that utf8_mode should be able to
force HTML::Entities to always return entities as UTF-8 octets,
yet this is currently not possible. The last comment on a sister
bug 73751 came to the same conclusion:
https://rt.cpan.org/Public/Bug/Display.html?id=73751
So what happens here? Given a UTF-8 encoded text (utf8 flag off)
with some HTML entities, those entities with code above 255 are
correctly represented as a UTF-8 sequence of octets and blends
naturally with the remaining UTF-8 text (utf8 flag still off).
However an entity with a Unicode codepoint in the 128..255 range
ends up as a single octet (Latin-1, utf8 flag off). So when this
octet is concatenated with the rest of the text, it is no longer
distinguishable from octets in the 128..255 range that are part
of other valid UTF-8 characters.
So the resulting string is a string of octets in mixed UTF-8 and
Latin-1 encodings. Characters that came from original UTF-8 text
and those that came from entities with high Unicode codepoints
are in UTF-8, but those octets that came from entities with low
codepoins are encoded in Latin-1. No matter in what encoding one
looks at the result, some characters are always mangled.
Here is another example (sorry for a sample text taken from an
advertisement which prompted me to search for this bug report):
use Encode;
use HTML::Parser;
my $html_utf8 =
"<p>GOOD: R\xC3\xA9ductions jusqu'à -70%. ".
"<p>BROKEN: Réductions jusqu'à -70%. ".
"<p>GOOD: Réductions jusqu'à -70%. Offerts de 9 € ".
"<p>GOOD: Offerts de 9 € ";
sub html_text {
my($self, $text) = @_;
printf("%d %s\n", Encode::is_utf8($text), $text);
}
my $p = HTML::Parser->new(
api_version => 3, utf8_mode => 1,
handlers => [ text => [\&html_text, "self,dtext"] ],
);
$p->parse($html_utf8);
results in:
0 GOOD: Réductions jusqu'à -70%.
0 BROKEN: R�ductions jusqu'� -70%.
0 GOOD: Réductions jusqu'à -70%. Offerts de 9 €
0 GOOD: Offerts de 9 €
Note that leaving out a <P> tag (case #3) avoids the problem:
having the adjacent € in the same paragraph forces
the à to be correctly encoded as a pair of UTF-8 octets.