Bug #71533 for XML-Fast: Suffers from "The Unicode Bug"

Fri Oct 07 15:41:33 2011 IKEGAMI [...] cpan.org - Ticket created

XML::Fast suffers from "The Unicode Bug", which is to say it is sensitive to how the XML string is stored internally when it shouldn't be. $ perl -MEncode -MXML::Fast -e' $xml = encode_utf8(qq{<root>\xB2</root>}); utf8::downgrade( $xml_d = $xml ); utf8::upgrade( $xml_u = $xml ); die if xml2hash($xml_d)->{root} ne xml2hash($xml_u)->{root}; ' Died at -e line 5.

Fri Oct 07 15:42:32 2011 IKEGAMI [...] cpan.org - Subject changed from (no value) to 'Suffers from "The Unicode Bug"'

Sat Oct 15 11:06:35 2011 MONS [...] cpan.org - Status changed from 'new' to 'open'

Sat Oct 15 11:06:40 2011 MONS [...] cpan.org - Taken

Sat Oct 15 16:46:33 2011 MONS [...] cpan.org - Correspondence added

This is not a bug in XML::Fast. Let me explain. The "Unicode Bug" is an unclear definition of behavior in perl itself. XML::Fast works good and the same way with UTF8 flagged and unflagged SVs. It threats any SV just as a byte buffer. ex: use Encode; use XML::Fast; my $sv = "<root>\x{456}</root>"; # String with UTF8 flag my $x1 = xml2hash($sv); Encode::_utf8_off($sv); my $x2 = xml2hash($sv); $x1->{root} eq $x2->{root} or die The problem is in your misunderstanding of utf8::upgrade role use Devel::Peek; my $sv = "\x{b2}"; Dump($sv); utf8::upgrade($sv); Dump($sv); SV = PV(0x100801070) at 0x100826d28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100201a80 "\262"\0 CUR = 1 LEN = 16 SV = PV(0x100801070) at 0x100826d28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100201a80 "\302\262"\0 [UTF8 "\x{b2}"] CUR = 2 LEN = 16 Please note the difference in PV value. In first case it is one byte \262, in second it is correct utf-8 sequence of 2 bytes \302\262, which represents unicode symbol U+00B2 \262 is not a valid utf-8 sequence. So the xml parser won't parse it into U+00B2. From perldoc utf8: * utf8::upgrade($string) Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-X ... So utf8::upgrade is an equivalent of Encode::decode('latin1', $sv) use Devel::Peek; use Encode; my $sv = "\x{b2}"; Dump($sv); $sv = Encode::decode( latin1 => $sv); Dump($sv); SV = PV(0x100801070) at 0x100826d58 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100201a80 "\262"\0 CUR = 1 LEN = 16 SV = PV(0x100801070) at 0x100826d58 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100283680 "\302\262"\0 [UTF8 "\x{b2}"] CUR = 2 LEN = 16 If you want \262 to be a unicode U+00B2 ("²"), your xml encoding should be latin-1, not utf- 8. perl -MEncode -e 'decode("UTF-8", "\xb2",Encode::FB_CROAK)' utf8 "\xB2" does not map to Unicode at /home/mons/lib/perl5/5.13.6/darwin- 2level/Encode.pm line 174.

Sat Oct 15 16:46:34 2011 MONS [...] cpan.org - Status changed from 'open' to 'rejected'

Sat Oct 15 16:48:52 2011 MONS [...] cpan.org - Reference by ticket #71532 added

Sat Oct 15 18:37:52 2011 IKEGAMI [...] cpan.org - Correspondence added

On Sat Oct 15 16:46:33 2011, MONS wrote: Show quoted text

> This is not a bug in XML::Fast. > Let me explain. > The "Unicode Bug" is an unclear definition of behavior in perl itself.

Thew "Unicode Bug" refers to program that behave differently based on the internal format of a string scalar. Show quoted text

> XML::Fast works good and the same way with UTF8 flagged and unflagged > SVs. It threats any SV just as a byte buffer.

I showed otherwise. It uses the internal encoding of the string rather than treating the string as a byte buffer. Show quoted text

> The problem is in your misunderstanding of utf8::upgrade role

upgrade and downgrade change the how a string is stored, not the content of the scalar. Since the content of the two scalars is the same, XML::Fast should return the same result for both. Show quoted text

> Please note the difference in PV value. > In first case it is one byte \262, in second it is correct utf-8 > sequence of 2 bytes \302\262, > which represents unicode symbol U+00B2

Look again. In both case, the scalar contains a string of length one: B2. Treating these two scalars differently is the very definition of the Unicode Bug. Sorry, but it's you that misunderstands. I presume you use SvPV, in which case the fix is to use SvPVbyte.

Sat Oct 15 18:37:53 2011 The RT System itself - Status changed from 'rejected' to 'open'

Sat Oct 15 18:47:25 2011 IKEGAMI [...] cpan.org - Correspondence added

On Sat Oct 15 16:46:33 2011, MONS wrote: Show quoted text

> \262 is not a valid utf-8 sequence.

My test didn't use \262, it used \302\262

Sun Oct 16 05:35:31 2011 MONS [...] cpan.org - Correspondence added

On Sat Oct 15 18:47:25 2011, ikegami wrote: Show quoted text

> On Sat Oct 15 16:46:33 2011, MONS wrote:

> > \262 is not a valid utf-8 sequence.

> > My test didn't use \262, it used \302\262

I show you: it is incorrect to do utf8::upgrade on either \262 or on \302\262: For \302\262 the byte buffer (PV) also differs. perl -MEncode -MDevel::Peek -E 'Dump my $sv = encode_utf8 "\xb2"; utf8::downgrade $sv; Dump $sv; utf8::upgrade $sv; Dump $sv' SV = PV(0x1008011a0) at 0x100826da0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100209120 "\302\262"\0 CUR = 2 LEN = 16 SV = PV(0x1008011a0) at 0x100826da0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100209120 "\302\262"\0 CUR = 2 LEN = 16 SV = PV(0x1008011a0) at 0x100826da0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100209120 "\303\202\302\262"\0 [UTF8 "\x{c2}\x{b2}"] CUR = 4 LEN = 16

Sun Oct 16 05:44:11 2011 MONS [...] cpan.org - Correspondence added

On Sat Oct 15 18:37:52 2011, ikegami wrote: Show quoted text

> On Sat Oct 15 16:46:33 2011, MONS wrote:

> > This is not a bug in XML::Fast. > > Let me explain. > > The "Unicode Bug" is an unclear definition of behavior in perl itself.

> > Thew "Unicode Bug" refers to program that behave differently based on > the internal format of a string scalar. >

> > XML::Fast works good and the same way with UTF8 flagged and unflagged > > SVs. It threats any SV just as a byte buffer.

> > I showed otherwise. It uses the internal encoding of the string rather > than treating the string as a byte buffer. >

> > The problem is in your misunderstanding of utf8::upgrade role

> > upgrade and downgrade change the how a string is stored, not the content > of the scalar. Since the content of the two scalars is the same, > XML::Fast should return the same result for both. >

> > Please note the difference in PV value. > > In first case it is one byte \262, in second it is correct utf-8 > > sequence of 2 bytes \302\262, > > which represents unicode symbol U+00B2

> > Look again. In both case, the scalar contains a string of length one: B2.

No! You pass different byte buffers to xml parser perl -MEncode -e' use bytes (); $xml = encode_utf8(qq{<root>\xB2</root>}); utf8::downgrade( $xml_d = $xml ); utf8::upgrade( $xml_u = $xml ); die if bytes::length($xml_d) != bytes::length($xml_u); ' Died at -e line 5. One string is of bytes length 2, another of length 4 Show quoted text

> Treating these two scalars differently is the very definition of the > Unicode Bug. Sorry, but it's you that misunderstands. > > I presume you use SvPV, in which case the fix is to use SvPVbyte.

SvPVbyte, SvPVutf8, and SvPV returns the same char* buffer: 2 bytes on first scalar (\302\262) and 4 bytes on second (\303\202\302\262). first is a sequence of one unicode char, second is a sequence of 2 unicode chars. I see: you want I do an implicit upgrade or downgrade on passed strings. But I won't do it. It will give wrong behavior on correct utf strings and dramatically slow down parsing speed because adds one more string pass during string recoding. What you pass is what I parse. It is fully the job of module user to compose correct byte buffer.

Sun Oct 16 05:44:12 2011 MONS [...] cpan.org - Status changed from 'open' to 'rejected'

Sun Oct 16 13:03:47 2011 MONS [...] cpan.org - Correspondence added

If you want, we may talk via skype or icq about this problem skype: inthrax icq: 99779956

Sun Oct 16 13:03:48 2011 The RT System itself - Status changed from 'rejected' to 'open'

Sun Oct 16 17:47:06 2011 ikegami [...] adaelis.com - Correspondence added

Subject:	Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date:	Sun, 16 Oct 2011 17:46:55 -0400
To:	bug-XML-Fast [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Sun, Oct 16, 2011 at 5:35 AM, Mons Anderson via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=71533 > > > On Sat Oct 15 18:47:25 2011, ikegami wrote:

> > On Sat Oct 15 16:46:33 2011, MONS wrote:

> > > \262 is not a valid utf-8 sequence.

> > > > My test didn't use \262, it used \302\262

> > I show you: it is incorrect to do utf8::upgrade on either \262 or on > \302\262: >

It is *never* incorrect to upgrade. Show quoted text

> For \302\262 the byte buffer (PV) also differs. >

Yes, the internal encoding varies. Keyword: internal. If you want to treat the string as bytes, one should use SvPVbyte.

Sun Oct 16 17:58:54 2011 ikegami [...] adaelis.com - Correspondence added

CC:	IKEGAMI [...] cpan.org
Subject:	Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date:	Sun, 16 Oct 2011 17:58:45 -0400
To:	bug-XML-Fast [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Sun, Oct 16, 2011 at 5:44 AM, Mons Anderson via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text

> > Look again. In both case, the scalar contains a string of length one: B2.

> > No! > You pass different byte buffers to xml parser > perl -MEncode -e' use bytes (); > $xml = encode_utf8(qq{<root>\xB2</root>}); > utf8::downgrade( $xml_d = $xml ); > utf8::upgrade( $xml_u = $xml ); > die if bytes::length($xml_d) != bytes::length($xml_u); > ' > Died at -e line 5. > > One string is of bytes length 2, another of length 4 >

Your proof relies on a module that's been deprecated because it suffers from the Unicode Bug! As you can see below, upgrade and downgrade does not change the content of the scalar: $ perl -E' $x = qq{\xB2}; utf8::downgrade( $x_d = $x ); say length $x_d; utf8::upgrade( $x_u = $x ); say length $x_u; say $x_d eq $x_u ? "same" : "diff"; ' 1 1 same $ perl -MEncode -E' $xml = encode_utf8(qq{<root>\xB2</root>}); utf8::downgrade( $xml_d = $xml ); say length $xml_d; utf8::upgrade( $xml_u = $xml ); say length $xml_u; say $xml_d eq $xml_u ? "same" : "diff"; ' 15 15 same It just changes how it's stored. Any code that relies on that, by definition, suffers from The Unicode Bug we've been fixing everywhere possible. I see: you want I do an implicit upgrade or downgrade on passed strings. Show quoted text

>

But I won't do it. Show quoted text

> It will give wrong behavior on correct utf strings and dramatically slow > down parsing speed >

Both those claims are false. 1) It will never do the wrong behaviour. Quite the opposite, it is currently exibiting the wrong behaviour, and it will fix it. 2) It will not slow down parsing at all. For those that pass a downgraded string, no work needs to be done. For those passing an upgraded string, it's currently not workingfor them at all. - Eric

Fri Oct 21 03:33:15 2011 VOVKASM [...] cpan.org - Correspondence added

Птн Окт 07 15:41:33 2011, ikegami писал: Show quoted text

> XML::Fast suffers from "The Unicode Bug", which is to say it is > sensitive to how the XML string is stored internally when it shouldn't

be. IMHO, xml parser should simple croak when input is upgraded string. Because XML encoding can be determined only by parser itself. (see http://www.w3.org/TR/2008/REC-xml-20081126/#charsets and http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding)

Fri Oct 21 08:57:06 2011 ikegami [...] adaelis.com - Correspondence added

CC:	IKEGAMI [...] cpan.org
Subject:	Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date:	Fri, 21 Oct 2011 08:56:54 -0400
To:	bug-XML-Fast [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Fri, Oct 21, 2011 at 3:33 AM, Vladimir Timofeev via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=71533 > > > Птн Окт 07 15:41:33 2011, ikegami писал:

> > XML::Fast suffers from "The Unicode Bug", which is to say it is > > sensitive to how the XML string is stored internally when it shouldn't

> be. > > IMHO, xml parser should simple croak when input is upgraded string. > Because XML encoding can be determined only by parser itself. (see > http://www.w3.org/TR/2008/REC-xml-20081126/#charsets and > http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding) >

I did not suggest that XML-Fast should accept decoded XML, so I don't see how this comment is relevant.

Wed Nov 20 18:11:01 2013 victor [...] vsespb.ru - Correspondence added

+1 for ikegami - this is a bug. On Fri Oct 07 23:41:33 2011, ikegami wrote: Show quoted text

> XML::Fast suffers from "The Unicode Bug", which is to say it is > sensitive to how the XML string is stored internally when it shouldn't be. > > $ perl -MEncode -MXML::Fast -e' > $xml = encode_utf8(qq{<root>\xB2</root>}); > utf8::downgrade( $xml_d = $xml ); > utf8::upgrade( $xml_u = $xml ); > die if xml2hash($xml_d)->{root} ne xml2hash($xml_u)->{root}; > ' > Died at -e line 5.

Wed Nov 20 19:11:01 2013 ikegami [...] adaelis.com - Correspondence added

CC:	IKEGAMI [...] cpan.org
Subject:	Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date:	Wed, 20 Nov 2013 19:10:50 -0500
To:	bug-XML-Fast [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

The fix is simple: SV* -_xml2hash(xml,conf) - char *xml; +_xml2hash(xml_sv,conf) + SV *xml_sv; HV *conf; PROTOTYPE: $$ CODE: + SvGETMAGIC(xml_sv); + char *xml = SvPVbyte_nolen(xml_sv); + SV * RV; On Wed, Nov 20, 2013 at 6:11 PM, Victor Efimov via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=71533 > > > +1 for ikegami - this is a bug. > > On Fri Oct 07 23:41:33 2011, ikegami wrote:

> > XML::Fast suffers from "The Unicode Bug", which is to say it is > > sensitive to how the XML string is stored internally when it shouldn't

> be.

> > > > $ perl -MEncode -MXML::Fast -e' > > $xml = encode_utf8(qq{<root>\xB2</root>}); > > utf8::downgrade( $xml_d = $xml ); > > utf8::upgrade( $xml_u = $xml ); > > die if xml2hash($xml_d)->{root} ne xml2hash($xml_u)->{root}; > > ' > > Died at -e line 5.

> > > >

Fri Sep 29 22:53:09 2017 IKEGAMI [...] cpan.org - Status changed from 'open' to 'resolved'

Fri Sep 29 22:53:09 2017 IKEGAMI [...] cpan.org - Fixed in 0.16 added