Bug #46582 for IO-Compress: Inflation Error when uncompressing a bzip2 file of XML data

Mon Jun 01 09:19:56 2009 nattiya [...] gmail.com - Ticket created

Subject:	Inflation Error when uncompressing a bzip2 file of XML data
Date:	Mon, 1 Jun 2009 15:19:28 +0200
To:	bug-IO-Compress [...] rt.cpan.org, pmqs [...] cpan.org
From:	Nattiya Kanhabua <nattiya [...] gmail.com>

Hello, I am trying to read contents from a bzip2 file of XML data by using IO::Uncompress::Bunzip2 (IO-Compress-2.019.tar.gz). The system is running v5.10.0 built for x86_64-linux-gnu-thread-multi. My program codes are as follows: ------------------------------------------------------------------------------------------------------------------------------------------------------- #!/usr/bin/perl use strict; use warnings; use IO::Uncompress::Bunzip2 qw( $Bunzip2Error ); my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages"; my $buf; my $read; my $gz = new IO::Uncompress::Bunzip2 $file or die "Cannot open $file: $Bunzip2Error\n" ; print $read, " ", $buf, "\n" while ($read= read($gz, $buf, 32768)) > 0 ; die "Error reading from $file: $Bunzip2Error\n" if $read < 0 ; $gz->close() ; ------------------------------------------------------------------------------------------------------------------------------------------------------- However, running the program failed and I got error messages like follows: ------------------------------------------------------------------------------------------------------------------------------------------------------- <page> <title>General Conference on Weights and Measures</title> <id>7339</id> <revision> <id>779001</id> <timestamp>2003-02-16T16:18:30Z</timestamp> </revision> Error reading from test.xml.bz2: Inflation Error: Data Error $ 1;2c ------------------------------------------------------------------------------------------------------------------------------------------------------- Please not that the XML data file is successfully parsed and read using XML::Parser. So, I am not sure what I am missing in my code to uncompress and read XML data. What does "Inflation Error" mean? Does this "1;2c" have special meaning in Perl? Thank you for your kindly help. Nattiya

Mon Jun 01 11:47:30 2009 pmqs [...] cpan.org - Correspondence added

A data error usually means the bzip2 file is corrupt. Can you test the integrity of the file with the commandline bzip2 command that should be available on your Linux box. bzip2 -t your-bzip2-file If that shows the file is ok, can you get back to me? Paul

Mon Jun 01 11:47:31 2009 The RT System itself - Status changed from 'new' to 'open'

Mon Jun 01 14:08:36 2009 nattiya [...] gmail.com - Correspondence added

Subject:	[rt.cpan.org #46582] Inflation Error when uncompressing a bzip2 file of XML data
Date:	Mon, 1 Jun 2009 20:08:16 +0200
To:	bug-IO-Compress [...] rt.cpan.org
From:	Nattiya Kanhabua <nattiya [...] gmail.com>

Hi Paul, Thanks for your reply. Sorry if this email is a bit long because I want to give you my findings. I have tested 'bzip2 -t test.xml.bz2' and I got nothing returned (no successful or error message). I guessed this corrupt check was OK since I compressed it using 'bzip2' command by myself. Moreover, when I uncompressed the XML file and parse/read by XML::Parser module, it was possible to manipulate the file until EOF. So, I am not sure if it is the data problem. I think read() sub in IO::Uncompress::Bunzip2 actually works since it can print out uncompressed XML contents as I showed in my first email. However, the read() as in 'while (read() > 0);' stops after some iterations (always at the same position, e.g. line#40 000, in the bzip2 XML file before EOF) and I got the message like 'Error reading from test.xml.bz2: Inflation Error: Data Error'. To make sure that it was not because of data contents, I created another XML file starting from the line, e.g. line#40 000, where the program stopped (deleted previously successfully read() ) and compressed (bzip2) it. The read() sub in IO::Uncompress::Bunzip2 worked like the first case where it stopped at some point in the file before EOF, and an interesting is that it processed exactly the same size of an XML input. So, I think error in data contents can be forgotten. I am very new to perl but I tried to debug and I found where in your code it is actually stopped. The program exited from Base.pm at line 862 when $status returned from uncompr() were as follows: ------------------------------------------------------------------------- STATUS_ERROR = 1 self->{Uncomp}{Error} = Inflation Error: Data Error self->{Uncomp}{ErrorNo} = Data Error ------------------------------------------------------------------------- $status returned from uncompr() is in Adapter/Bzip2.pm at the line 4: $status = $inf->bzinflate($from, $to);. I am wondering if the size of a compressed file causes a problem to bzinflate($from, $to)? I tested with the size of 2MB only (25MB of uncompressed size), but my real data input will be about 150GB... Thank you very much for your kindly help. Nattiya On Mon, Jun 1, 2009 at 5:47 PM, Paul Marquess via RT <bug-IO-Compress@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=46582 > > > A data error usually means the bzip2 file is corrupt. Can you test the > integrity of the file with the commandline bzip2 command that should be > available on your Linux box. > > bzip2 -t your-bzip2-file > > If that shows the file is ok, can you get back to me? > > Paul > >

Mon Jun 01 15:46:51 2009 pmqs [...] cpan.org - Correspondence added

could you email me an example bzip file? Use the pmqs@cpan.org address Paul

Mon Jun 01 18:21:24 2009 pmqs [...] cpan.org - Correspondence added

Thanks for the file - I can confirm that the file is fine, but my module does think there is an error with it. Will investigate & get back to you. Paul

Mon Jun 01 18:49:33 2009 nattiya [...] gmail.com - Correspondence added

Subject:	[rt.cpan.org #46582] Inflation Error when uncompressing a bzip2 file of XML data
Date:	Tue, 2 Jun 2009 00:49:14 +0200
To:	bug-IO-Compress [...] rt.cpan.org
From:	Nattiya Kanhabua <nattiya [...] gmail.com>

Thanks for your time and kindly response. Nattiya On Tue, Jun 2, 2009 at 12:21 AM, Paul Marquess via RT <bug-IO-Compress@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=46582 > > > Thanks for the file - I can confirm that the file is fine, but my module > does think there is an error with it. Will investigate & get back to you. > > Paul > >

Tue Jun 02 09:06:23 2009 pmqs [...] cpan.org - Correspondence added

Nattiya, I have a fix available for the problem you reported. I'll post an update onto CPAN in the next few days. If you want to tryout the fix, here is what you need to change. Run Perl -V to see where your perl library files are installed - they are probably under /usr/lib/perl5. Now find the file IO/Uncompress/Base.pm goto line 857 and change this line $self->pushBack($temp_buf) if $beforeC_len != length $temp_buf; to this $self->pushBack($temp_buf) ; If you have any further problems, please get back to me. Paul

Tue Jun 02 13:48:56 2009 nattiya [...] gmail.com - Correspondence added

Subject:	[rt.cpan.org #46582] Inflation Error when uncompressing a bzip2 file of XML data
Date:	Tue, 2 Jun 2009 19:48:34 +0200
To:	bug-IO-Compress [...] rt.cpan.org
From:	Nattiya Kanhabua <nattiya [...] gmail.com>

Hi Paul, It is working well so far. Now I am testing with the real data of 160GB compressed where it will take a while to finish. I might get back to you if I notify some other bugs. Thank you very much for your kindly help. Bests, Nattiya

Tue Jun 02 18:17:21 2009 pmqs [...] cpan.org - Correspondence added

Hi Nattiya, good to hear it is working now. PLease get back to me if you encounter any problems. cheers Paul

Wed Jun 03 13:49:43 2009 pmqs [...] cpan.org - Correspondence added

Hi Nattiya, just uploaded a new version of IO-Compress to CPAN with the fix included. Paul

Wed Jun 03 16:26:39 2009 nattiya [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #46582] Inflation Error when uncompressing a bzip2 file of XML data
Date:	Wed, 3 Jun 2009 22:26:20 +0200
To:	bug-IO-Compress [...] rt.cpan.org
From:	Nattiya Kanhabua <nattiya [...] gmail.com>

Hi Paul, Thank you for the updated version. I have questions related to functionality, not bug. If you could not answer here because it is out of your scope, then I understand :) It is about scalability/efficiency because of my huge amount of data. Basically, I want to use 2-3 computers working in parallel to uncompress the same file and hopefully this will faster the process. So, I need to use seek() into a file at a specified position -- without uncompress along the way. The current seek() in your module sequentially reads data and uncompresses it even the part before the specified position. This might be useful in some cases, but for my case, it just waste CPU time uncompressing unwanted data (data before a seeking position). I do not want to disturb you or harm your code, but I might need to change something to optimize it to my task, In order to better understanding your code, I would like to ask you as follows: 1. What is sub smartSeek() in Base.pm? Is it for this purpose I mention above -- seek() to the position and ignore uncompressing? 2. What is the maximum value of BlockSize (let say I have more than 10GB of memory)? I would like to set it as much as possible to faster uncompress process. As you are an expert on Compression/Uncompression, I hope that you can help me answering so I can work on my own. Thank you for your kindly response/time. Bests, Nattiya

Thu Jun 04 03:32:53 2009 pmqs [...] cpan.org - Correspondence added

On Wed Jun 03 16:26:39 2009, nattiya@gmail.com wrote: Show quoted text

> Hi Paul, > > Thank you for the updated version. I have questions related to > functionality, not bug. If you could not answer here because it is out > of your scope, then I understand :)

No problem. Show quoted text

> It is about scalability/efficiency because of my huge amount of data. > Basically, I want to use 2-3 computers working in parallel to > uncompress the same file and hopefully this will faster the process. > So, I need to use seek() into a file at a specified position -- > without uncompress along the way. The current seek() in your module > sequentially reads data and uncompresses it even the part before the > specified position. This might be useful in some cases, but for my > case, it just waste CPU time uncompressing unwanted data (data before > a seeking position).

For a lot of compressed data formats the only way to seek to a specific offset is to uncompress the file sequentially until the uncompressed offset is reached. ... but see at the end for a possible solution. Show quoted text

> I do not want to disturb you or harm your code, but I might need to > change something to optimize it to my task, In order to better > understanding your code, I would like to ask you as follows: > > 1. What is sub smartSeek() in Base.pm? Is it for this purpose I > mention above -- seek() to the position and ignore uncompressing?

Don't bother with that - it can only ever work by uncompressing the file to carry out the seek operation. Show quoted text

> 2. What is the maximum value of BlockSize (let say I have more than > 10GB of memory)? I would like to set it as much as possible to faster > uncompress process.

Here is a quote from the bzip man page (www.bzip.org) Compression and decompression speed are virtually unaffected by block size. Show quoted text

> As you are an expert on Compression/Uncompression, I hope that you can > help me answering so I can work on my own. Thank you for your kindly > response/time.

You are using bzip2, which uses a block file structure. This means that you can chop the file into parts - the standard bzip2 command line program comes with a utility called bzip2recover - this will split the bzip2 file into chunks at the block boundaries. The big problem with doing this is if the thing you want to find in the compressed file spans a block boundary. Or in your case if you are attempting to parse an XML document. Paul

Thu Jun 04 08:06:34 2009 nattiya [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #46582] Inflation Error when uncompressing a bzip2 file of XML data
Date:	Thu, 4 Jun 2009 14:06:15 +0200
To:	bug-IO-Compress [...] rt.cpan.org
From:	Nattiya Kanhabua <nattiya [...] gmail.com>

Hi Paul, Show quoted text

> You are using bzip2, which uses a block file structure. This means that > you can chop the file into parts - the standard bzip2 command line > program comes with a utility called bzip2recover - this will split the > bzip2 file into chunks at the block boundaries.

I am appreciated your very good suggestion and I will try to work out on it :) Thank you! Nattiya

Sun Jan 03 10:17:12 2010 pmqs [...] cpan.org - Status changed from 'open' to 'resolved'