Bug #26588 for XML-SAX: charset decoding is broken

Mon Apr 23 16:53:30 2007 ntyni [...] iki.fi - Ticket created

Subject:

charset decoding is broken

On 0.15, when reading from a stream, XML::SAX::PurePerl does not decode the first 4096 bytes into the Perl internal UTF-8 representation, although it sets the filehandle PerlIO encoding to UTF-8. This is a regression from 0.12. The problem is that the Unicode version of XML::SAX::PurePerl::Reader::switch_encoding_string() uses Encode::from_to(), which does not set the Perl internal UTF-8 flag. Replacing this with eg. Encode::decode() fixes the bug. This affects those bytes that are first read into the buffer before setting the PerlIO encoding. With the fix to SAX/PurePerl/Reader/UnicodeExt.pm, there's a test failure from t/14encoding.t. It turns out that there are bugs in XML::SAX::PurePerl::Productions : the $NameChar regexp shouldn't use $Letter, since that contains beginning and end anchors (^ and $). In fact, it looks like the $Letter production is unused now and $NameChar shouldn't have any anchors either. (It also looks like the binding of the anchors is broken, since /^a|b$/ means (/^a/ || /b$/), not /^(a|b)$/.) I'm attaching a proposed patch that adds a testcase for these issues and fixes them for me. The tests pass for me on 0.12 and fail on 0.15. I haven't tested on an old non-Unicode Perl; this is on Perl 5.8.8 on Debian Etch (4.0). I'm a bit uneasy that switch_encoding_string() can't be called twice now without a fatal error, but I'm not sure what is the best thing to do. Maybe just make it a no-op if the new charset is UTF-8 and Encode::is_utf8 is set? I suppose it has never worked if the charset is not UTF-8 on the second call.... FWIW, this issue has caused Debian bug #405186, http://bugs.debian.org/405186 . Please let me know if you need more information; I'll be happy to help in any way I can. Cheers, -- Niko Tyni ntyni@iki.fi

Subject:

XML-SAX-patch-new

Download XML-SAX-patch-new
application/octet-stream 4k

Message body not shown because it is not plain text.

Mon Apr 23 17:42:13 2007 grant [...] mclean.net.nz - Correspondence added

Subject:	Re: [rt.cpan.org #26588] charset decoding is broken
Date:	Tue, 24 Apr 2007 09:41:43 +1200
To:	bug-XML-SAX [...] rt.cpan.org
From:	Grant McLean <grant [...] mclean.net.nz>

Hi Thanks for the comprehensive report and patch. I hope to integrate it and get a release out 'soon' :-) One thing that intrigues me is that the Debian bug you referred to was a build failure in the package docbook2x. That package sepcifies a dependency on libxml-sax-expat-perl. If that package is installed, then the XML::SAX::PurePerl module shouldn't be being used at all. I'd recommend only using the PurePerl parser in situations where building expat of libxml2 libraries is problematic - that is certainly not the case with Debian. The dependencies do seem to be sane, I just don't understand why the PurePerl module was being used. Or did the dependency get changed to libxml-sax-expat-perl *after* the bug report? Notwithstanding my confusion above, I am interested in tracking down and fixing bugs in XML::SAX::PurePerl, so thanks again. Regards Grant On Mon, 2007-04-23 at 16:53 -0400, ntyni@iki.fi via RT wrote: Show quoted text

> Mon Apr 23 16:53:30 2007: Request 26588 was acted upon. > Transaction: Ticket created by ntyni@iki.fi > Queue: XML-SAX > Subject: charset decoding is broken > Broken in: 0.15 > Severity: Important > Owner: Nobody > Requestors: ntyni@iki.fi > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=26588 > > > > On 0.15, when reading from a stream, XML::SAX::PurePerl does not decode > the first 4096 bytes into the Perl internal UTF-8 representation, > although it sets the filehandle PerlIO encoding to UTF-8. > > This is a regression from 0.12. The problem is that the Unicode version > of XML::SAX::PurePerl::Reader::switch_encoding_string() uses > Encode::from_to(), which does not set the Perl internal UTF-8 flag. > Replacing this with eg. Encode::decode() fixes the bug. This affects > those bytes that are first read into the buffer before setting the > PerlIO encoding. > > With the fix to SAX/PurePerl/Reader/UnicodeExt.pm, there's a test > failure from t/14encoding.t. It turns out that there are bugs in > XML::SAX::PurePerl::Productions : the $NameChar regexp shouldn't use > $Letter, since that contains beginning and end anchors (^ and $). In > fact, it looks like the $Letter production is unused now and $NameChar > shouldn't have any anchors either. (It also looks like the binding of > the anchors is broken, since /^a|b$/ means (/^a/ || /b$/), not /^(a|b)$/.) > > I'm attaching a proposed patch that adds a testcase for these issues and > fixes them for me. The tests pass for me on 0.12 and fail on 0.15. I > haven't tested on an old non-Unicode Perl; this is on Perl 5.8.8 on > Debian Etch (4.0). > > I'm a bit uneasy that switch_encoding_string() can't be called twice now > without a fatal error, but I'm not sure what is the best thing to do. > Maybe just make it a no-op if the new charset is UTF-8 and > Encode::is_utf8 is set? I suppose it has never worked if the charset is > not UTF-8 on the second call.... > > FWIW, this issue has caused Debian bug #405186, > http://bugs.debian.org/405186 . > > Please let me know if you need more information; I'll be happy to help > in any way I can. > > Cheers,

Mon Apr 23 17:42:29 2007 The RT System itself - Status changed from 'new' to 'open'

Mon Apr 23 18:00:36 2007 ntyni [...] iki.fi - Correspondence added

Subject:	Re: [rt.cpan.org #26588] charset decoding is broken
Date:	Tue, 24 Apr 2007 01:00:23 +0300
To:	Grant McLean via RT <bug-XML-SAX [...] rt.cpan.org>
From:	Niko Tyni <ntyni [...] iki.fi>

On Mon, Apr 23, 2007 at 05:42:30PM -0400, Grant McLean via RT wrote: Show quoted text

> One thing that intrigues me is that the Debian bug you referred to was a > build failure in the package docbook2x. That package sepcifies a > dependency on libxml-sax-expat-perl. If that package is installed, then > the XML::SAX::PurePerl module shouldn't be being used at all. > > I'd recommend only using the PurePerl parser in situations where > building expat of libxml2 libraries is problematic - that is certainly > not the case with Debian. The dependencies do seem to be sane, I just > don't understand why the PurePerl module was being used. Or did the > dependency get changed to libxml-sax-expat-perl *after* the bug report?

It looks like docbook2x depends on libxml-sax-perl for building the package ('Build-Depends'), and libxml-sax-expat-perl for actually using it. See http://packages.debian.org/unstable/source/docbook2x (or 'apt-cache showsrc docbook2x' if you are on Debian yourself.) I agree that this is confusing. I'm not very familiar with docbook2x myself, I just tracked the problem into this decoding difference. I don't know why they don't use libxml-sax-expat-perl for building the package too. Probably because they assume libxml-sax-perl is simpler because its name is shorter :) Show quoted text

> Notwithstanding my confusion above, I am interested in tracking down and > fixing bugs in XML::SAX::PurePerl, so thanks again.

Thank you for maintaining the module! Cheers, -- Niko Tyni ntyni@iki.fi

Tue Feb 12 14:58:19 2008 ntyni [...] iki.fi - Correspondence added

From:

ntyni [...] iki.fi

On Mon Apr 23 17:42:13 2007, grant@mclean.net.nz wrote: Show quoted text

> One thing that intrigues me is that the Debian bug you referred to was > a > build failure in the package docbook2x. That package sepcifies a > dependency on libxml-sax-expat-perl. If that package is installed, > then > the XML::SAX::PurePerl module shouldn't be being used at all. > > I'd recommend only using the PurePerl parser in situations where > building expat of libxml2 libraries is problematic - that is certainly > not the case with Debian. The dependencies do seem to be sane, I just > don't understand why the PurePerl module was being used. Or did the > dependency get changed to libxml-sax-expat-perl *after* the bug > report?

Sorry it took me so long to get back to this, but the real reason why the PurePerl parser was used was a bug in the Debian-specific ParserDetails.ini handling. See http://bugs.debian.org/430118 . I have since assumed the maintainership of this module in Debian (together with others in the Debian Perl Group), and this should now be finally fixed so that all other SAX parsers will now override the PurePerl one. You can find some more information on how the ParserDetails.ini file is handled in Debian at eg. http://svn.debian.org/wsvn/pkg-perl/trunk/libxml-sax-perl/debian/libxml-sax-perl.README.Debian?op=file&rev=0&sc=0 I'd be happy if we could find some way to converge on a common way of doing this, but obviously backward compatibility is going to hinder this for both of us. Still, if you'd like to discuss this more (possibly in another ticket or direct email, preferably via pkg-perl-maintainers@lists.alioth.debian.org), I'd be glad to help in any way I can. Again, thanks for your work on XML::SAX. Cheers, -- Niko Tyni ntyni@debian.org

Wed Aug 06 06:35:17 2008 grantm [...] cpan.org - Correspondence added

Thanks for your patch and detailed analysis (and also for your patience). I have integrated the patch in the recent 0.96 release of XML::SAX. Expect further releases as I work my way through the RT queue. Regards Grant

Wed Aug 06 06:35:20 2008 grantm [...] cpan.org - Status changed from 'open' to 'resolved'