Bug #99456 for HTML-Parser: HTML::TokeParser utf8 flag is not always preserved

Mon Oct 13 04:22:13 2014 YKAR [...] cpan.org - Ticket created

Subject:

HTML::TokeParser utf8 flag is not always preserved

Please look at the test case.

Subject:

preserve_utf8.t

use strict; use warnings; use Test::More; use HTML::TokeParser; my $s = "Hello World"; utf8::upgrade($s); my $p = HTML::TokeParser->new(\$s); ok(utf8::is_utf8($s), 'input is utf8'); my $t = $p->get_text; ok(utf8::is_utf8($t), 'output is utf8'); done_testing();

Mon Oct 13 11:50:26 2014 ether [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #99456] HTML::TokeParser utf8 flag is not always preserved
Date:	Mon, 13 Oct 2014 08:50:08 -0700
To:	Yuri Karaban via RT <bug-HTML-Parser [...] rt.cpan.org>
From:	Karen Etheridge <ether [...] cpan.org>

Why is this important? If the string does not actually contain any non-ascii characters, the utf8 flag should not be relied upon to contain anything meaningful to the end user.

Mon Oct 13 11:50:26 2014 The RT System itself - Status changed from 'new' to 'open'

Mon Oct 13 13:47:21 2014 YKAR [...] cpan.org - Correspondence added

RT-Send-CC:

ether [...] cpan.org

On Mon Oct 13 11:50:26 2014, ETHER wrote: Show quoted text

> Why is this important? If the string does not actually contain any > non-ascii characters, the utf8 flag should not be relied upon to contain > anything meaningful to the end user.

HTML::TokeParser decodes HTML entity   to non ASCII character 0xa0. When string is Unicode it means the U+00A0 code point (non-breaking space), but for raw octets 0xa0 does not have any particular meaning. It's not even a question of preserving utf8 flag. HTML::TokeParser should set utf8 flag if it has decoded HTML entities which does not map to ASCII (even if input document was pure ASCII).

Mon Oct 13 14:01:46 2014 ether [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #99456] HTML::TokeParser utf8 flag is not always preserved
Date:	Mon, 13 Oct 2014 11:01:29 -0700
To:	Yuri Karaban via RT <bug-HTML-Parser [...] rt.cpan.org>
From:	Karen Etheridge <ether [...] cpan.org>

On Mon, Oct 13, 2014 at 01:47:22PM -0400, Yuri Karaban via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=99456 > > > On Mon Oct 13 11:50:26 2014, ETHER wrote: >

> > Why is this important? If the string does not actually contain any > > non-ascii characters, the utf8 flag should not be relied upon to contain > > anything meaningful to the end user.

> > HTML::TokeParser decodes HTML entity   to non ASCII character 0xa0. When string is Unicode it means the U+00A0 code point (non-breaking space), but for raw octets 0xa0 does not have any particular meaning.

HTML::TokeParser shouldn't be using the is_utf8 flag to make any decisions. It should treat every incoming string equally - either 0xa0 always means non-breaking space, or it doesn't. Show quoted text

> It's not even a question of preserving utf8 flag. HTML::TokeParser should set utf8 flag if it has decoded HTML entities which does not map to ASCII (even if input document was pure ASCII).

No, if it ever sets the utf8 flag, it should *always* set it, even if all the characters fall in the ascii range. It's not correct to only set it if some values are not ascii.

Mon Oct 13 15:13:03 2014 YKAR [...] cpan.org - Correspondence added

RT-Send-CC:

ether [...] cpan.org

On Mon Oct 13 14:01:46 2014, ETHER wrote: Show quoted text

> HTML::TokeParser shouldn't be using the is_utf8 flag to make any > decisions. > It should treat every incoming string equally - either 0xa0 always > means > non-breaking space, or it doesn't.

I'm not asking to treat input differently depending utf8 flag. I'm asking to set utf8 flag on output. Show quoted text

> > It's not even a question of preserving utf8 flag. HTML::TokeParser > > should set utf8 flag if it has decoded HTML entities which does not > > map to ASCII (even if input document was pure ASCII).

> > No, if it ever sets the utf8 flag, it should *always* set it, even if > all > the characters fall in the ascii range. It's not correct to only set > it if > some values are not ascii.

I'm all for unconditionally setting utf8 on output. Unfortunately HTML::TokeParser is setting utf8 flag only if output contains characters with code points greater than 255. Example: my $p = HTML::TokeParser->new(\'™'); # U+2122 say '™ results in utf8' if utf8::is_utf8($p->get_text); $p = HTML::TokeParser->new(\'©'); # U+00A9 say "© dosn't result in utf8" unless utf8::is_utf8($p->get_text);

Tue Oct 14 16:10:33 2014 GAAS [...] cpan.org - Correspondence added

This is done on purpose. The parser selects the most compact way to represent the strings internally. You should not put any semantic meaning of the internal UTF8-flag.

Wed Oct 15 04:56:15 2014 YKAR [...] cpan.org - Correspondence added

RT-Send-CC:

ether [...] cpan.org

On Tue Oct 14 16:10:33 2014, GAAS wrote: Show quoted text

> This is done on purpose. The parser selects the most compact > way to represent the strings internally. You should not put > any semantic meaning of the internal UTF8-flag.

You both right. I'm sorry for taking your time. I wrongly identified the problem, offending module is DBD::mysql. mysql_enable_utf8 has effect just in one way, it decodes data on the way from mysql->client with sv_utf8_decode, but does not have an effect on data coming from client->mysql. It's just using SvPV when binding parameters, so if string is internally encoded in utf8 it's sent correctly, but if string is octet string it's sent as raw octets. Marc Lehmann has opened ticked regarding this problem: https://rt.cpan.org/Ticket/Display.html?id=87428 Unfortunately this problem does not have simple solution without breaking existing code. Automatic upgrade of octet strings to utf8 strings would break code which passes already encoded utf8 strings. A maintainer has suggested a new API which won't break other code, which would encode/decode data in both ways. But it's still not implemented. For now, safe but ugly way is to always call utf8::encode when binding parameters.

Tue Jan 19 11:54:50 2016 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'