Subject: | PurePerl parser rejects DTD with unhelpful error message |
Date: | Tue, 8 Dec 2009 15:43:42 +0000 |
To: | bug-XML-SAX [...] rt.cpan.org |
From: | mca+cpanrt [...] sanger.ac.uk |
I had this problem with XML-SAX-0.96 .
"perl-5.8.8 -v" reports
This is perl, v5.8.8 built for x86_64-linux-thread-multi
[...]
"uname -a" reports
Linux seq1a 2.6.22.19-lustre-1.6.7.1 #2 SMP Fri Apr 17 17:52:49 BST 2009 x86_64 GNU/Linux
It's a Debian 4.0 machine but the Perl is built and installed by our
systems group & extra modules provided by the pathogen analysis group.
I believe I can demonstrate the problem cleanly... but the issue is
muddied at our end by having two XML::SAX installations with
independent ParserDetails.ini files, one of which didn't show the
problem because it defaulted to XML::LibXML::SAX.
The problem is provoked in the PurePerl parser by DTD such as
<!ELEMENT superscaffold (scaffold, (superbridge+,scaffold)*) >
The error message is
choice/seq contains no opening bracket [Ln: 12, Col: 125326784]
which we found unhelpful outside the context of parsing the DTD of an
XML document. Also the column number doesn't make any sense to me, I
didn't investigate that any further.
A workaround is to re-write it as
<!ELEMENT superscaffold (scaffold, ((superbridge)+,scaffold)*) >
Quick summary of versions,
all using XML::SAX::ParserFactory v1.01
XML::SAX::PurePerl v0.96 fails; v0.90 and v0.92 work OK
XML::LibXML::SAX v1.69, the W3C validator and some Java XML parser
all agree that the original document is valid
I'm sorry I haven't found reference to the relevant piece of DTD
(E)BNF, or worked out a patch. I have included a short example that
provokes the problem, inline below.
To fix my own code, I merely insist on using XML::LibXML.
I hope the bug report is useful,
--
Matthew
#! /software/bin/perl
#
# (That's the non-OS Perl instance support by our sysads;
# /usr/bin/perl has no XML parser, DBI etc. installed. Local software
# is thus decoupled from OS upgrades.)
# This is a minimal SAX handler class, it never sees action
package NulHandl;
use base 'XML::SAX::Base';
package main;
use strict;
use warnings;
use YAML 'Dump';
use XML::SAX::ParserFactory;
sub main {
# Dictate the parser
$XML::SAX::ParserPackage = "XML::SAX::PurePerl";
# $XML::SAX::ParserPackage = "XML::LibXML::SAX";
# Set up
my $xml = join "", <DATA>;
my $xh = NulHandl->new;
my $sax = XML::SAX::ParserFactory->parser(Handler => $xh);
# It chokes on "+" in ChoiceOrSeq. We can fix it either of these
# ways,
# $xml =~ s{\b(\w+)\+}{($1)+}g; # bracket the element
# $xml =~ s{\b(\w+)\+}{$1}g; # remove the plus
# Show some info
my %info =
(xml_length => length($xml),
'%INC' => \%INC,
versions => { "XML::SAX" => XML::SAX->VERSION,
"XML::SAX::PurePerl" => XML::SAX::PurePerl->VERSION,
"XML::LibXML" => XML::LibXML->VERSION,
perl => $] },
'$sax' => $sax);
print Dump(\%info);
# Make it go BANG
my $eod = $sax->parse_string($xml);
print "\n** finished without error **\n";
}
main();
__DATA__
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE assembly [
<!ELEMENT assembly (superscaffold*) >
<!ATTLIST assembly
instance CDATA #REQUIRED
organism CDATA #REQUIRED
date CDATA #REQUIRED
Show quoted text
>
<!ELEMENT superscaffold (scaffold, (superbridge+,scaffold)*) >
<!ATTLIST superscaffold
id CDATA #REQUIRED
size CDATA #REQUIRED
Show quoted text>
<!ELEMENT scaffold (contig, (gap,contig)*) >
<!ATTLIST scaffold
id CDATA #REQUIRED
sense (F|R) #REQUIRED
Show quoted text>
<!ELEMENT contig EMPTY>
<!ATTLIST contig
id CDATA #REQUIRED
name CDATA #IMPLIED
size CDATA #REQUIRED
project CDATA #REQUIRED
sense (F|R) #REQUIRED
Show quoted text>
<!ELEMENT gap (bridge+)>
<!ATTLIST gap
size CDATA #REQUIRED
Show quoted text>
<!ELEMENT bridge (link+)>
<!ATTLIST bridge
template CDATA #REQUIRED
name CDATA #IMPLIED
silow CDATA #REQUIRED
sihigh CDATA #REQUIRED
gapsize CDATA #REQUIRED
Show quoted text>
<!ELEMENT superbridge (link+)>
<!ATTLIST superbridge
template CDATA #REQUIRED
name CDATA #IMPLIED
silow CDATA #REQUIRED
sihigh CDATA #REQUIRED
Show quoted text>
<!ELEMENT link EMPTY>
<!ATTLIST link
contig CDATA #REQUIRED
read CDATA #REQUIRED
cstart CDATA #REQUIRED
cfinish CDATA #REQUIRED
sense (F|R) #REQUIRED
Show quoted text>
]>
<!-- this data is truncated and redacted because it is not the cause of the problem -->
<assembly instance="pathogen" organism="FOO" date="2009-09-23 14:38:27" >
<superscaffold id="1" size="1538161" >
<scaffold id="1" sense="F" >
<contig id="1480" size="1223160" project="1" sense="F" />
</scaffold>
</superscaffold>
</assembly>
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.