On Sat Oct 15 18:37:52 2011, ikegami wrote:
Show quoted text> On Sat Oct 15 16:46:33 2011, MONS wrote:
> > This is not a bug in XML::Fast.
> > Let me explain.
> > The "Unicode Bug" is an unclear definition of behavior in perl itself.
>
> Thew "Unicode Bug" refers to program that behave differently based on
> the internal format of a string scalar.
>
> > XML::Fast works good and the same way with UTF8 flagged and unflagged
> > SVs. It threats any SV just as a byte buffer.
>
> I showed otherwise. It uses the internal encoding of the string rather
> than treating the string as a byte buffer.
>
> > The problem is in your misunderstanding of utf8::upgrade role
>
> upgrade and downgrade change the how a string is stored, not the content
> of the scalar. Since the content of the two scalars is the same,
> XML::Fast should return the same result for both.
>
> > Please note the difference in PV value.
> > In first case it is one byte \262, in second it is correct utf-8
> > sequence of 2 bytes \302\262,
> > which represents unicode symbol U+00B2
>
> Look again. In both case, the scalar contains a string of length one: B2.
No!
You pass different byte buffers to xml parser
perl -MEncode -e' use bytes ();
$xml = encode_utf8(qq{<root>\xB2</root>});
utf8::downgrade( $xml_d = $xml );
utf8::upgrade( $xml_u = $xml );
die if bytes::length($xml_d) != bytes::length($xml_u);
'
Died at -e line 5.
One string is of bytes length 2, another of length 4
Show quoted text> Treating these two scalars differently is the very definition of the
> Unicode Bug. Sorry, but it's you that misunderstands.
>
> I presume you use SvPV, in which case the fix is to use SvPVbyte.
SvPVbyte, SvPVutf8, and SvPV returns the same char* buffer: 2 bytes on first scalar
(\302\262) and 4 bytes on second (\303\202\302\262).
first is a sequence of one unicode char, second is a sequence of 2 unicode chars.
I see: you want I do an implicit upgrade or downgrade on passed strings.
But I won't do it.
It will give wrong behavior on correct utf strings and dramatically slow down parsing speed
because adds one more string pass during string recoding.
What you pass is what I parse.
It is fully the job of module user to compose correct byte buffer.