Skip Menu |

This queue is for tickets about the XML-Fast CPAN distribution.

Report information
The Basics
Id: 71533
Status: resolved
Priority: 0/
Queue: XML-Fast

People
Owner: MONS [...] cpan.org
Requestors: IKEGAMI [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 0.16



XML::Fast suffers from "The Unicode Bug", which is to say it is sensitive to how the XML string is stored internally when it shouldn't be. $ perl -MEncode -MXML::Fast -e' $xml = encode_utf8(qq{<root>\xB2</root>}); utf8::downgrade( $xml_d = $xml ); utf8::upgrade( $xml_u = $xml ); die if xml2hash($xml_d)->{root} ne xml2hash($xml_u)->{root}; ' Died at -e line 5.
This is not a bug in XML::Fast. Let me explain. The "Unicode Bug" is an unclear definition of behavior in perl itself. XML::Fast works good and the same way with UTF8 flagged and unflagged SVs. It threats any SV just as a byte buffer. ex: use Encode; use XML::Fast; my $sv = "<root>\x{456}</root>"; # String with UTF8 flag my $x1 = xml2hash($sv); Encode::_utf8_off($sv); my $x2 = xml2hash($sv); $x1->{root} eq $x2->{root} or die The problem is in your misunderstanding of utf8::upgrade role use Devel::Peek; my $sv = "\x{b2}"; Dump($sv); utf8::upgrade($sv); Dump($sv); SV = PV(0x100801070) at 0x100826d28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100201a80 "\262"\0 CUR = 1 LEN = 16 SV = PV(0x100801070) at 0x100826d28 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100201a80 "\302\262"\0 [UTF8 "\x{b2}"] CUR = 2 LEN = 16 Please note the difference in PV value. In first case it is one byte \262, in second it is correct utf-8 sequence of 2 bytes \302\262, which represents unicode symbol U+00B2 \262 is not a valid utf-8 sequence. So the xml parser won't parse it into U+00B2. From perldoc utf8: * utf8::upgrade($string) Converts in-place the internal representation of the string from an octet sequence in the native encoding (Latin-1 or EBCDIC) to UTF-X ... So utf8::upgrade is an equivalent of Encode::decode('latin1', $sv) use Devel::Peek; use Encode; my $sv = "\x{b2}"; Dump($sv); $sv = Encode::decode( latin1 => $sv); Dump($sv); SV = PV(0x100801070) at 0x100826d58 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100201a80 "\262"\0 CUR = 1 LEN = 16 SV = PV(0x100801070) at 0x100826d58 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100283680 "\302\262"\0 [UTF8 "\x{b2}"] CUR = 2 LEN = 16 If you want \262 to be a unicode U+00B2 ("²"), your xml encoding should be latin-1, not utf- 8. perl -MEncode -e 'decode("UTF-8", "\xb2",Encode::FB_CROAK)' utf8 "\xB2" does not map to Unicode at /home/mons/lib/perl5/5.13.6/darwin- 2level/Encode.pm line 174.
On Sat Oct 15 16:46:33 2011, MONS wrote: Show quoted text
> This is not a bug in XML::Fast. > Let me explain. > The "Unicode Bug" is an unclear definition of behavior in perl itself.
Thew "Unicode Bug" refers to program that behave differently based on the internal format of a string scalar. Show quoted text
> XML::Fast works good and the same way with UTF8 flagged and unflagged > SVs. It threats any SV just as a byte buffer.
I showed otherwise. It uses the internal encoding of the string rather than treating the string as a byte buffer. Show quoted text
> The problem is in your misunderstanding of utf8::upgrade role
upgrade and downgrade change the how a string is stored, not the content of the scalar. Since the content of the two scalars is the same, XML::Fast should return the same result for both. Show quoted text
> Please note the difference in PV value. > In first case it is one byte \262, in second it is correct utf-8 > sequence of 2 bytes \302\262, > which represents unicode symbol U+00B2
Look again. In both case, the scalar contains a string of length one: B2. Treating these two scalars differently is the very definition of the Unicode Bug. Sorry, but it's you that misunderstands. I presume you use SvPV, in which case the fix is to use SvPVbyte.
On Sat Oct 15 16:46:33 2011, MONS wrote: Show quoted text
> \262 is not a valid utf-8 sequence.
My test didn't use \262, it used \302\262
On Sat Oct 15 18:47:25 2011, ikegami wrote: Show quoted text
> On Sat Oct 15 16:46:33 2011, MONS wrote:
> > \262 is not a valid utf-8 sequence.
> > My test didn't use \262, it used \302\262
I show you: it is incorrect to do utf8::upgrade on either \262 or on \302\262: For \302\262 the byte buffer (PV) also differs. perl -MEncode -MDevel::Peek -E 'Dump my $sv = encode_utf8 "\xb2"; utf8::downgrade $sv; Dump $sv; utf8::upgrade $sv; Dump $sv' SV = PV(0x1008011a0) at 0x100826da0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100209120 "\302\262"\0 CUR = 2 LEN = 16 SV = PV(0x1008011a0) at 0x100826da0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0x100209120 "\302\262"\0 CUR = 2 LEN = 16 SV = PV(0x1008011a0) at 0x100826da0 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100209120 "\303\202\302\262"\0 [UTF8 "\x{c2}\x{b2}"] CUR = 4 LEN = 16
On Sat Oct 15 18:37:52 2011, ikegami wrote: Show quoted text
> On Sat Oct 15 16:46:33 2011, MONS wrote:
> > This is not a bug in XML::Fast. > > Let me explain. > > The "Unicode Bug" is an unclear definition of behavior in perl itself.
> > Thew "Unicode Bug" refers to program that behave differently based on > the internal format of a string scalar. >
> > XML::Fast works good and the same way with UTF8 flagged and unflagged > > SVs. It threats any SV just as a byte buffer.
> > I showed otherwise. It uses the internal encoding of the string rather > than treating the string as a byte buffer. >
> > The problem is in your misunderstanding of utf8::upgrade role
> > upgrade and downgrade change the how a string is stored, not the content > of the scalar. Since the content of the two scalars is the same, > XML::Fast should return the same result for both. >
> > Please note the difference in PV value. > > In first case it is one byte \262, in second it is correct utf-8 > > sequence of 2 bytes \302\262, > > which represents unicode symbol U+00B2
> > Look again. In both case, the scalar contains a string of length one: B2.
No! You pass different byte buffers to xml parser perl -MEncode -e' use bytes (); $xml = encode_utf8(qq{<root>\xB2</root>}); utf8::downgrade( $xml_d = $xml ); utf8::upgrade( $xml_u = $xml ); die if bytes::length($xml_d) != bytes::length($xml_u); ' Died at -e line 5. One string is of bytes length 2, another of length 4 Show quoted text
> Treating these two scalars differently is the very definition of the > Unicode Bug. Sorry, but it's you that misunderstands. > > I presume you use SvPV, in which case the fix is to use SvPVbyte.
SvPVbyte, SvPVutf8, and SvPV returns the same char* buffer: 2 bytes on first scalar (\302\262) and 4 bytes on second (\303\202\302\262). first is a sequence of one unicode char, second is a sequence of 2 unicode chars. I see: you want I do an implicit upgrade or downgrade on passed strings. But I won't do it. It will give wrong behavior on correct utf strings and dramatically slow down parsing speed because adds one more string pass during string recoding. What you pass is what I parse. It is fully the job of module user to compose correct byte buffer.
If you want, we may talk via skype or icq about this problem skype: inthrax icq: 99779956
Subject: Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date: Sun, 16 Oct 2011 17:46:55 -0400
To: bug-XML-Fast [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
On Sun, Oct 16, 2011 at 5:35 AM, Mons Anderson via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=71533 > > > On Sat Oct 15 18:47:25 2011, ikegami wrote:
> > On Sat Oct 15 16:46:33 2011, MONS wrote:
> > > \262 is not a valid utf-8 sequence.
> > > > My test didn't use \262, it used \302\262
> > I show you: it is incorrect to do utf8::upgrade on either \262 or on > \302\262: >
It is *never* incorrect to upgrade. Show quoted text
> For \302\262 the byte buffer (PV) also differs. >
Yes, the internal encoding varies. Keyword: internal. If you want to treat the string as bytes, one should use SvPVbyte.
CC: IKEGAMI [...] cpan.org
Subject: Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date: Sun, 16 Oct 2011 17:58:45 -0400
To: bug-XML-Fast [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
On Sun, Oct 16, 2011 at 5:44 AM, Mons Anderson via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text
> > Look again. In both case, the scalar contains a string of length one: B2.
> > No! > You pass different byte buffers to xml parser > perl -MEncode -e' use bytes (); > $xml = encode_utf8(qq{<root>\xB2</root>}); > utf8::downgrade( $xml_d = $xml ); > utf8::upgrade( $xml_u = $xml ); > die if bytes::length($xml_d) != bytes::length($xml_u); > ' > Died at -e line 5. > > One string is of bytes length 2, another of length 4 >
Your proof relies on a module that's been deprecated because it suffers from the Unicode Bug! As you can see below, upgrade and downgrade does not change the content of the scalar: $ perl -E' $x = qq{\xB2}; utf8::downgrade( $x_d = $x ); say length $x_d; utf8::upgrade( $x_u = $x ); say length $x_u; say $x_d eq $x_u ? "same" : "diff"; ' 1 1 same $ perl -MEncode -E' $xml = encode_utf8(qq{<root>\xB2</root>}); utf8::downgrade( $xml_d = $xml ); say length $xml_d; utf8::upgrade( $xml_u = $xml ); say length $xml_u; say $xml_d eq $xml_u ? "same" : "diff"; ' 15 15 same It just changes how it's stored. Any code that relies on that, by definition, suffers from The Unicode Bug we've been fixing everywhere possible. I see: you want I do an implicit upgrade or downgrade on passed strings. Show quoted text
>
But I won't do it. Show quoted text
> It will give wrong behavior on correct utf strings and dramatically slow > down parsing speed >
Both those claims are false. 1) It will never do the wrong behaviour. Quite the opposite, it is currently exibiting the wrong behaviour, and it will fix it. 2) It will not slow down parsing at all. For those that pass a downgraded string, no work needs to be done. For those passing an upgraded string, it's currently not workingfor them at all. - Eric
Птн Окт 07 15:41:33 2011, ikegami писал: Show quoted text
> XML::Fast suffers from "The Unicode Bug", which is to say it is > sensitive to how the XML string is stored internally when it shouldn't
be. IMHO, xml parser should simple croak when input is upgraded string. Because XML encoding can be determined only by parser itself. (see http://www.w3.org/TR/2008/REC-xml-20081126/#charsets and http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding)
CC: IKEGAMI [...] cpan.org
Subject: Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date: Fri, 21 Oct 2011 08:56:54 -0400
To: bug-XML-Fast [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
On Fri, Oct 21, 2011 at 3:33 AM, Vladimir Timofeev via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=71533 > > > Птн Окт 07 15:41:33 2011, ikegami писал:
> > XML::Fast suffers from "The Unicode Bug", which is to say it is > > sensitive to how the XML string is stored internally when it shouldn't
> be. > > IMHO, xml parser should simple croak when input is upgraded string. > Because XML encoding can be determined only by parser itself. (see > http://www.w3.org/TR/2008/REC-xml-20081126/#charsets and > http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding) >
I did not suggest that XML-Fast should accept decoded XML, so I don't see how this comment is relevant.
+1 for ikegami - this is a bug. On Fri Oct 07 23:41:33 2011, ikegami wrote: Show quoted text
> XML::Fast suffers from "The Unicode Bug", which is to say it is > sensitive to how the XML string is stored internally when it shouldn't be. > > $ perl -MEncode -MXML::Fast -e' > $xml = encode_utf8(qq{<root>\xB2</root>}); > utf8::downgrade( $xml_d = $xml ); > utf8::upgrade( $xml_u = $xml ); > die if xml2hash($xml_d)->{root} ne xml2hash($xml_u)->{root}; > ' > Died at -e line 5.
CC: IKEGAMI [...] cpan.org
Subject: Re: [rt.cpan.org #71533] Suffers from "The Unicode Bug"
Date: Wed, 20 Nov 2013 19:10:50 -0500
To: bug-XML-Fast [...] rt.cpan.org
From: Eric Brine <ikegami [...] adaelis.com>
The fix is simple: SV* -_xml2hash(xml,conf) - char *xml; +_xml2hash(xml_sv,conf) + SV *xml_sv; HV *conf; PROTOTYPE: $$ CODE: + SvGETMAGIC(xml_sv); + char *xml = SvPVbyte_nolen(xml_sv); + SV * RV; On Wed, Nov 20, 2013 at 6:11 PM, Victor Efimov via RT < bug-XML-Fast@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=71533 > > > +1 for ikegami - this is a bug. > > On Fri Oct 07 23:41:33 2011, ikegami wrote:
> > XML::Fast suffers from "The Unicode Bug", which is to say it is > > sensitive to how the XML string is stored internally when it shouldn't
> be.
> > > > $ perl -MEncode -MXML::Fast -e' > > $xml = encode_utf8(qq{<root>\xB2</root>}); > > utf8::downgrade( $xml_d = $xml ); > > utf8::upgrade( $xml_u = $xml ); > > die if xml2hash($xml_d)->{root} ne xml2hash($xml_u)->{root}; > > ' > > Died at -e line 5.
> > > >