Skip Menu |

This queue is for tickets about the XML-Parser CPAN distribution.

Report information
The Basics
Id: 28585
Status: resolved
Priority: 0/
Queue: XML-Parser

People
Owner: Nobody in particular
Requestors: Ondrej.Sluciak [...] sitronicsts.sk
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Subject: XML::Parser BUG: XML::Parser does not parse big XML files correctly
Date: Tue, 31 Jul 2007 15:02:58 +0200
To: bug-XML-Parser [...] rt.cpan.org
From: Ondrej Sluciak <Ondrej.Sluciak [...] sitronicsts.sk>
It looks like XML::Parser can't parse XML files bigger than 32kB correctly. It seems like it has some kind of buffer of size 32,768 B which, when full, is erased and reused without caring where in the XML file the parser currently is. So if 32,768th byte (every N x 32,768th Byte) of the XML file is just in the middle of some string (i.e. string between two tags), the string is split into two strings, which can be seen in two $x->original_string "variables"... example: <?xml... ... ...<tag1>expialidotion</tag1> ...<tag2>something</tag2> .. where "t" in "something" is 32,768th byte in processed XML file,and using some Char_handler: sub char_hndl{ my ($xp,$data,$oposum) = @_; if (($xp->current_byte > 32400 && $xp->current_byte <32900) || ($xp->current_byte > 65400 && $xp->current_byte <65900) ){ print "Str: ".$xp->original_string."\n"; } } I got this output: ... Str: expialidotion Str: somet Str: ing .. This is just not good, because I use "$xp->current_element" further as an input to database, so instead of "something" in DB I get just "ing"... I'm handling this bug just like that: if $xp->current_byte is "near" the "wrong" bytes ( n x 32kB), I manually concatenate the two strings "somet" and "ing", save it into new string and just then I send it into database. Though it works, it is not very nice solution. Otherwise if (N x)32,768th byte is in the middle of tag, everything works perfectly. It fails only if it falls between tags. I'm using XML-Parser-2.3.4, Perl 5.8.8.
On Tue Jul 31 09:03:25 2007, Ondrej.Sluciak@sitronicsts.sk wrote: Show quoted text
> It looks like XML::Parser can't parse XML files bigger than 32kB > correctly. It seems like it has some kind of buffer of size 32,768 B > which, when full, is erased and reused without caring where in the XML > file the parser currently is. So if 32,768th byte (every N x 32,768th > Byte) of the XML file is just in the middle of some string (i.e. string > between two tags), the string is split into two strings, which can be > seen in two $x->original_string "variables"...
This is documented behaviour, text elements are NOT garanteed to be returned in a single callback. This is what the dos have to say about it: Char (Expat, String) This event is generated when non-markup is recognized. The non-markup sequence of characters is in String. A single non-markup sequence of characters may generate multiple calls to this handler. Whatever the encoding of the string in the original document, this is given to the handler in UTF-8. The proper way to handle character data is to buffer it, and then use it once you get to the end tag (see for example http://www.perlmonks.org/?node_id=31798 ). __ mirod