On Mon, Oct 13, 2014 at 01:47:22PM -0400, Yuri Karaban via RT wrote:
Show quoted text> <URL:
https://rt.cpan.org/Ticket/Display.html?id=99456 >
>
> On Mon Oct 13 11:50:26 2014, ETHER wrote:
>
> > Why is this important? If the string does not actually contain any
> > non-ascii characters, the utf8 flag should not be relied upon to contain
> > anything meaningful to the end user.
>
> HTML::TokeParser decodes HTML entity to non ASCII character 0xa0. When string is Unicode it means the U+00A0 code point (non-breaking space), but for raw octets 0xa0 does not have any particular meaning.
HTML::TokeParser shouldn't be using the is_utf8 flag to make any decisions.
It should treat every incoming string equally - either 0xa0 always means
non-breaking space, or it doesn't.
Show quoted text> It's not even a question of preserving utf8 flag. HTML::TokeParser should set utf8 flag if it has decoded HTML entities which does not map to ASCII (even if input document was pure ASCII).
No, if it ever sets the utf8 flag, it should *always* set it, even if all
the characters fall in the ascii range. It's not correct to only set it if
some values are not ascii.