Subject: | current_byte method returns overflowed data when parsing very large XML files |
When parsing a very large XML file, somewhere over 2 gigabytes, the value returned from the
current_byte method of XML::Parser::Expat will return negative values. Attached to this ticket
is a simple program to parse an XML file and print the current byte value every second. The
output when this bug shows its head looks like this:
2134412390
2137250345
2140088951
2142891080
2145707930
-2146463171
-2143671866
-2140868846
-2138058386
-2135266961
-2132475521
This is a particular problem because the Wikipedia dump files are extremely large.
Subject: | xml-parser-expat-overflow.pl |
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Parser;
our $LAST_BYTE = 0;
$SIG{ALRM} = \&print_byte;
my $xml = XML::Parser->new(
Handlers => {
Char => \&char
}
);
alarm(1);
$xml->parsefile(shift(@ARGV));
sub char {
my ($e) = @_;
$LAST_BYTE = $e->current_byte;
}
sub print_byte {
print "$LAST_BYTE\n";
alarm(1);
}