Subject: | XML::Parser BUG: XML::Parser does not parse big XML files correctly |
Date: | Tue, 31 Jul 2007 15:02:58 +0200 |
To: | bug-XML-Parser [...] rt.cpan.org |
From: | Ondrej Sluciak <Ondrej.Sluciak [...] sitronicsts.sk> |
It looks like XML::Parser can't parse XML files bigger than 32kB
correctly. It seems like it has some kind of buffer of size 32,768 B
which, when full, is erased and reused without caring where in the XML
file the parser currently is. So if 32,768th byte (every N x 32,768th
Byte) of the XML file is just in the middle of some string (i.e. string
between two tags), the string is split into two strings, which can be
seen in two $x->original_string "variables"...
example:
<?xml...
...
...<tag1>expialidotion</tag1>
...<tag2>something</tag2>
..
where "t" in "something" is 32,768th byte in processed XML file,and
using some Char_handler:
sub char_hndl{
my ($xp,$data,$oposum) = @_;
if (($xp->current_byte > 32400 && $xp->current_byte <32900) ||
($xp->current_byte > 65400 && $xp->current_byte <65900) ){
print "Str: ".$xp->original_string."\n";
}
}
I got this output:
...
Str: expialidotion
Str: somet
Str: ing
..
This is just not good, because I use "$xp->current_element" further as
an input to database, so instead of "something" in DB I get just
"ing"... I'm handling this bug just like that: if $xp->current_byte is
"near" the "wrong" bytes ( n x 32kB), I manually concatenate the two
strings "somet" and "ing", save it into new string and just then I send
it into database. Though it works, it is not very nice solution.
Otherwise if (N x)32,768th byte is in the middle of tag, everything
works perfectly. It fails only if it falls between tags.
I'm using XML-Parser-2.3.4, Perl 5.8.8.