Hm.
perl -e 'use Devel::Peek; use HTML::Entities; $str = " ";
HTML::Entities::decode_entities( $str); print Dump($str)'
SV = PV(0xc7fb78) at 0xca35b0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xc9db30 "\240"\0
CUR = 1
LEN = 8
(bytes string, ISO-8859-1, correct)
$ perl -e 'use Devel::Peek; use HTML::Entities; $str = "€";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x112bb78) at 0x114f5b0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1149b30 "\342\202\254"\0 [UTF8 "\x{20ac}"]
CUR = 3
LEN = 8
(UTF-8, correct)
$ perl -e 'use Encode; use Devel::Peek; use HTML::Entities; $str1 =
" "; HTML::Entities::decode_entities( $str1); $str2 = "€";
HTML::Entities::decode_entities( $str2); print Dump($str1.$str2)'
SV = PV(0x12a6b58) at 0x12ca588
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK,UTF8)
PV = 0x12d3eb0 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"]
CUR = 5
LEN = 8
(UTF-8 correct)
$ perl -e 'use Devel::Peek; use HTML::Entities; $str = " €";
HTML::Entities::decode_entities( $str ); print Dump($str)'
SV = PV(0x1ccbb78) at 0x1cef5b0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x1ce9b30 "\302\240\342\202\254"\0 [UTF8 "\x{a0}\x{20ac}"]
CUR = 5
LEN = 16
(UTF-8, correct)
It looks correct,
as if we concatenate character string with wide characters and byte
string, byte string treated as ISO-8859-1
Show quoted textstring, Perl will assume that your binary string was encoded with
ISO-8859-1, also known as latin-1
so seems internal representation of character/bytes is correct in all
cases and compatible with text processing.
However when you parse 3rd party HTML you can expect unicode there,
so would be good to have a flag/pragma which force HTML::Entities to
always return UTF-8 characters to avoid
The "Unicode Bug"
http://perldoc.perl.org/perlunicode.html
using utf8::upgrade
On Tue Jan 08 17:05:48 2013, vsespb wrote:
Show quoted text> So, sometimes it returns correct (UTF-8) character string
>
> perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
> HTML::Entities; $str = "€ ";
> HTML::Entities::decode_entities( $str ); print Dump($str)'
> SV = PV(0xd67b78) at 0xd95220
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0xd85b60 "\342\202\254\302\240"\0 [UTF8 "\x{20ac}\x{a0}"]
> CUR = 5
> LEN = 16
>
> Sometimes ISO-8859-1 BYTE string
>
> perl -e 'use open qw/:std :utf8/; use Encode; use Devel::Peek; use
> HTML::Entities; $str = " "; HTML::Entities::decode_entities( $str
> ); print Dump($str)'
> SV = PV(0x12fcb78) at 0x132a200
> REFCNT = 1
> FLAGS = (POK,pPOK)
> PV = 0x131ab50 "\240"\0
> CUR = 1
> LEN = 8
>
> I think that's a bug.
>
> On Mon May 07 13:40:19 2012, DAXIM wrote:
> > The code is congruent with the documentation, `decode_entities` does
> > indeed return characters.
> >
> > $ perl -mHTML::Entities -E'say HTML::Entities->VERSION'
> > 3.69
> > $ perl -MHTML::Entities=decode_entities -Mcharnames=:full -e'print
> > "\N{LATIN SMALL LETTER U WITH DIAERESIS}" eq decode_entities "ü"'
> > 1
> >
> > It is wrong to look at the internal representation of the string
> > (because that's Perl's business, not the user's) and `Encode::decode`
> > (because these are already characters). The bug stems from your
> > assumption that `decode_entities` returns octets and/or (possibly
> > unwittingly) treating the return value as octets later on, so this is
> > purely about character semantics (which unfortunately is invisible in
> > Perl).
> >
> > Since the report jumped to conclusions, it would be nice if you append
> > some code example of exactly how it »happened to [you] that a decoded
> > html code of "ü" cannot be show correctly in utf8«. Perhaps this
> > can be worked into a documentation improvement where appropriate, or
> > perhaps it reveals a genuine bug in a different module.
> >
> > PS: I am not a maintainer of this distro, therefore keeping this bug in
> > state "opened".
>