Bug #73751 for HTML-Parser: [HTML::Entities]the char encode of the decoded

Wed Jan 04 16:31:48 2012 alvin.rxg [...] gmail.com - Ticket created

Subject:	[HTML::Entities]the char encode of the decoded
Date:	Wed, 4 Jan 2012 22:31:40 +0100
To:	bug-HTML-Parser [...] rt.cpan.org
From:	alvin Ren <alvin.rxg [...] gmail.com>

Hello, here i report a bug for HTML::Entities. first of all, i didn't test all of the values... it just happened to me that a decoded html code of "ü" cannot be show correctly in utf8. with `perldoc HTML::Entities` it says ------------------------------------------------------------ decode_entities( $string, ... ) This routine replaces HTML entities found in the $string with the corresponding Unicode character. ------------------------------------------------------------ and the returned character is in Latin-1 or iso-8859, which cannot be show correctly in utf8. according to the file /usr/lib/perl5/HTML/Entities.pm ------------------------------------------------------------ line 163: # PUBLIC ISO 8879-1986//ENTITIES Added Latin 1//EN//HTML ------------------------------------------------------------ and with decode("iso-8859-1", decode_entities("ü")) the returned value is right in utf8. please correct the problem, if i am right :) Environment: Debian 6.0.3 perl 5.10.1 and my code file is in utf8. regards alvin

Mon May 07 05:40:19 2012 daxim [...] cpan.org - Correspondence added

The code is congruent with the documentation, `decode_entities` does indeed return characters. $ perl -mHTML::Entities -E'say HTML::Entities->VERSION' 3.69 $ perl -MHTML::Entities=decode_entities -Mcharnames=:full -e'print "\N{LATIN SMALL LETTER U WITH DIAERESIS}" eq decode_entities "ü"' 1 It is wrong to look at the internal representation of the string (because that's Perl's business, not the user's) and `Encode::decode` (because these are already characters). The bug stems from your assumption that `decode_entities` returns octets and/or (possibly unwittingly) treating the return value as octets later on, so this is purely about character semantics (which unfortunately is invisible in Perl). Since the report jumped to conclusions, it would be nice if you append some code example of exactly how it »happened to [you] that a decoded html code of "ü" cannot be show correctly in utf8«. Perhaps this can be worked into a documentation improvement where appropriate, or perhaps it reveals a genuine bug in a different module. PS: I am not a maintainer of this distro, therefore keeping this bug in state "opened".

Mon May 07 05:40:20 2012 The RT System itself - Status changed from 'new' to 'open'

Sun May 13 08:32:27 2012 GAAS [...] cpan.org - Correspondence added

If you want UTF-8 you need to decode the string as UTF-8:

$ perl -MEncode -MHTML::Entities -le 'print encode("UTF-8", decode_entities("ü"))'

ü

Sun May 13 08:32:27 2012 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'

Sun May 13 08:34:29 2012 GAAS [...] cpan.org - Correspondence added

Show quoted text

> If you want UTF-8 you need to decode the string as UTF-8:

I meant "encode" here ;-(

Sun May 13 08:34:59 2012 The RT System itself - Status changed from 'rejected' to 'open'

Tue Jan 08 08:05:48 2013 victor [...] vsespb.ru - Correspondence added

From:

victor [...] vsespb.ru

So, sometimes it returns correct (UTF-8) character string perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use HTML::Entities; $str = "€ "; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0xd67b78) at 0xd95220 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"] CUR = 5 LEN = 16 Sometimes ISO-8859-1 BYTE string perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0x12fcb78) at 0x132a200 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x131ab50 "\240"\0 CUR = 1 LEN = 8 I think that's a bug. On Mon May 07 13:40:19 2012, DAXIM wrote: Show quoted text

> The code is congruent with the documentation, `decode_entities` does > indeed return characters. > > $ perl -mHTML::Entities -E'say HTML::Entities->VERSION' > 3.69 > $ perl -MHTML::Entities=decode_entities -Mcharnames=:full -e'print > "\N{LATIN SMALL LETTER U WITH DIAERESIS}" eq decode_entities "ü"' > 1 > > It is wrong to look at the internal representation of the string > (because that's Perl's business, not the user's) and `Encode::decode` > (because these are already characters). The bug stems from your > assumption that `decode_entities` returns octets and/or (possibly > unwittingly) treating the return value as octets later on, so this is > purely about character semantics (which unfortunately is invisible in > Perl). > > Since the report jumped to conclusions, it would be nice if you append > some code example of exactly how it »happened to [you] that a decoded > html code of "ü" cannot be show correctly in utf8«. Perhaps this > can be worked into a documentation improvement where appropriate, or > perhaps it reveals a genuine bug in a different module. > > PS: I am not a maintainer of this distro, therefore keeping this bug in > state "opened".

Tue Jan 08 09:27:23 2013 victor [...] vsespb.ru - Correspondence added

From:

victor [...] vsespb.ru

Hm. perl -e 'use Devel::Peek; use HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str); print Dump($str)' SV = PV(0xc7fb78) at 0xca35b0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xc9db30 "\240"\0 CUR = 1 LEN = 8 (bytes string, ISO-8859-1, correct) $ perl -e 'use Devel::Peek; use HTML::Entities; $str = "€"; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0x112bb78) at 0x114f5b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1149b30 "\342\202\254"\0 [UTF8 "\x{20ac}"] CUR = 3 LEN = 8 (UTF-8, correct) $ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str1 = " "; HTML::Entities::decode_entities( $str1); $str2 = "€"; HTML::Entities::decode_entities( $str2); print Dump($str1.$str2)' SV = PV(0x12a6b58) at 0x12ca588 REFCNT = 1 FLAGS = (PADTMP,POK,pPOK,UTF8) PV = 0x12d3eb0 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"] CUR = 5 LEN = 8 (UTF-8 correct) $ perl -e 'use Devel::Peek; use HTML::Entities; $str = " €"; HTML::Entities::decode_entities( $str ); print Dump($str)' SV = PV(0x1ccbb78) at 0x1cef5b0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x1ce9b30 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"] CUR = 5 LEN = 16 (UTF-8, correct) It looks correct, as if we concatenate character string with wide characters and byte string, byte string treated as ISO-8859-1 Show quoted text

> http://perldoc.perl.org/perlunifaq.html > What if I don't decode? > Whenever your encoded, binary string is used together with a text

string, Perl will assume that your binary string was encoded with ISO-8859-1, also known as latin-1 so seems internal representation of character/bytes is correct in all cases and compatible with text processing. However when you parse 3rd party HTML you can expect unicode there, so would be good to have a flag/pragma which force HTML::Entities to always return UTF-8 characters to avoid The "Unicode Bug" http://perldoc.perl.org/perlunicode.html using utf8::upgrade On Tue Jan 08 17:05:48 2013, vsespb wrote: Show quoted text

> So, sometimes it returns correct (UTF-8) character string > > perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use > HTML::Entities; $str = "€ "; > HTML::Entities::decode_entities( $str ); print Dump($str)' > SV = PV(0xd67b78) at 0xd95220 > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"] > CUR = 5 > LEN = 16 > > Sometimes ISO-8859-1 BYTE string > > perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use > HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str > ); print Dump($str)' > SV = PV(0x12fcb78) at 0x132a200 > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x131ab50 "\240"\0 > CUR = 1 > LEN = 8 > > I think that's a bug. > > On Mon May 07 13:40:19 2012, DAXIM wrote:

> > The code is congruent with the documentation, `decode_entities` does > > indeed return characters. > > > > $ perl -mHTML::Entities -E'say HTML::Entities->VERSION' > > 3.69 > > $ perl -MHTML::Entities=decode_entities -Mcharnames=:full -e'print > > "\N{LATIN SMALL LETTER U WITH DIAERESIS}" eq decode_entities "ü"' > > 1 > > > > It is wrong to look at the internal representation of the string > > (because that's Perl's business, not the user's) and `Encode::decode` > > (because these are already characters). The bug stems from your > > assumption that `decode_entities` returns octets and/or (possibly > > unwittingly) treating the return value as octets later on, so this is > > purely about character semantics (which unfortunately is invisible in > > Perl). > > > > Since the report jumped to conclusions, it would be nice if you append > > some code example of exactly how it »happened to [you] that a decoded > > html code of "ü" cannot be show correctly in utf8«. Perhaps this > > can be worked into a documentation improvement where appropriate, or > > perhaps it reveals a genuine bug in a different module. > > > > PS: I am not a maintainer of this distro, therefore keeping this bug in > > state "opened".

>

Tue Jan 19 11:56:39 2016 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'