Bug #105460 for Compress-Raw-Lzma: use of Compress::Raw::Lzma::RawDecoder fails with large amount of data

Wed Jun 24 04:50:35 2015 perl_monkey [...] Safe-mail.net - Ticket created

Subject:	use of Compress::Raw::Lzma::RawDecoder fails with large amount of data
Date:	Wed, 24 Jun 2015 04:50:21 -0400
To:	bug-Compress-Raw-Lzma [...] rt.cpan.org
From:	perl_monkey [...] Safe-mail.net

I recently discovered a little problem with a script I've developed which uses the Compress::Raw::Lzma::RawEncoder / RawDecoder. Basically, the problem is that when I compress a (very) large buffer it doesn't successfully decompress it. Instead I get the error "Data is corrupted". I'm going to attach a minimal perl script (simplified version of what I am trying to do) that shows how to reproduce this problem. I am not quite sure yet if I am doing something wrong or not, but it seems to work perfectly with smaller input buffers. Furthermore, I did try to find some hints if there are some limits or if that is a known problem (or other users experienced the same issue) but I didn't find much. Also playing around with MemLimit, Bufsize and LimitOutput didn't help at all. To explain the code (attached): when using a small (random) input length to the lzma compression it always work. But after we increase that size/length to about 64k, it most of the time won't work. If it is much larger than 64k it always fails. What I mean by "it doesn't work" is that the perl script isn't able to compress a (huge) random string and later on decompress it to the original string. Can you please help me to understand the problem ?

Message body is not shown because sender requested not to inline it.

Thu Jan 05 06:04:14 2017 perl_monkey [...] Safe-mail.net - Correspondence added

Subject:	[rt.cpan.org #105460] Successfully identified problem and patch proposal
Date:	Thu, 5 Jan 2017 06:03:54 -0500
To:	bug-Compress-Raw-Lzma [...] rt.cpan.org
From:	perl_monkey [...] Safe-mail.net

Hey, after a lot of troubleshooting and debugging it seems that I was able to identify the main cause of the problem reported by me in https://rt.cpan.org/Public/Bug/Display.html?id=105460 . I dissected my minimal perl script even further and tried to see when exactly the problems occur and if it is somehow reproduceable/understandable. I made these changes to the lzma_test.pl script, provided (by me) in the orginal bug report. Diff: - $status = $lz->code ($random_input, $compressed_data); - $status = $lz->flush ($compressed_data); + my $temporary_res = ""; + + $status = $lz->code ($random_input, $compressed_data); + $temporary_res .= $compressed_data; + + $status = $lz->flush ($compressed_data); + $temporary_res .= $compressed_data; + + $compressed_data = $temporary_res; ... and it surprisingly started to work even for larger inputs to the raw encoder. Then I of course wanted to understand why we need to do so (and why apparently only for large inputs). I had a further look at the documentation at http://search.cpan.org/~pmqs/Compress-Raw-Lzma/lib/Compress/Raw/Lzma.pm and of course noticed immediately the "AppendOutput" option (which seemed to be related). Furthermore, I saw that for Compress::Raw::Lzma::RawEncoder the default should be true/1 (quote: "Defaults to 1."). I then tried to do a different patch to the original lzma_test.pl script: - my ($lz, $status) = Compress::Raw::Lzma::RawEncoder->new; + my ($lz, $status) = Compress::Raw::Lzma::RawEncoder->new (AppendOutput => 1); ... and it surprisingly started to work (even without the "$temporary_res" changes above), even with very large inputs. So my conclusion now was that the defaults somehow are not correct (or at least do not reflect what the documentation says). They conflict with each other. While looking at http://search.cpan.org/~pmqs/Compress-Raw-Lzma/lib/Compress/Raw/Lzma.pm I came accross some further conflicts (i.e. where the documentation does not reflect what the code does). The funny thing is that I remember trying all possible options before submitting the bug, but it seems that I didn't test to specify only the "AppendOutput => 1" option. Instead, as far as I remember, I tried to combine some options (like AppendOutput + ForZip etc) and therefore the output still was not as expected/correct. Here are some other problems that I noticed while reading http://search.cpan.org/~pmqs/Compress-Raw-Lzma/lib/Compress/Raw/Lzma.pm : - There are no default values listed for Compress::Raw::Lzma::StreamDecoder and Compress::Raw::Lzma::RawDecoder (which in my opinion would be important too). - For Compress::Raw::Lzma::AutoDecoder and Compress::Raw::Lzma::AloneDecoder the documentation says MemLimit "Default is unlimited.", while the code specifies a limit of 128 *1024 *1024 I will attach to this email a patch proposal. It changes the defaults within the code because in my opinion changing the documentation would not really solve the problem. If instead you think solving the conflict by changing the documentation would be better, I highly suggest that we change the default example. Most importantly this section: my ($lz, $status) = new Compress::Raw::Lzma::RawEncoder [OPTS] or die "Cannot create lzma object: $status\n"; $status = $lz->code($input, $output); $status = $lz->flush($output); ... because we now know that this is not correct (especially with the RawEncoder), because without forcing to set AppendOutput to 1, it will most likely fail (at least with "larger" inputs). Furthermore, if we would change only the documentation we must make it very clear that AppendOutput defaults to 0 and we need to add some warnings about not forgetting to save the temporary output to a variable if the whole output is expected to be within the "output". Since it is very likely that the users will make the mistake to forget to set AppendOutput => 1 or to forget to save the temporary results and "concatenate"/use them, I would suggest that the code fix (see patch) is much more appropriate and safer (I have tested it of course and with the fixes the defaults are correct). Thanks

Message body is not shown because sender requested not to inline it.

Sun Jan 08 17:57:12 2017 pmqs [...] cpan.org - Correspondence added

Hey, thanks for the feedback. My intention was that AppendOutout would default to 0, plus that is what the code actually does, and has been doing for years. Changing the code behavior to match the documentation runs the risk of breaking existing code. The fix I will make is to change the documentation to match the code. Your original code (below) failed because the call to flush overwrote the data stored in the $compressed_data variable that was populated by the call to $lz->code. $status = $lz->code ($random_input, $compressed_data); $status = $lz->flush ($compressed_data); cheers Paul

Sun Jan 08 17:57:12 2017 The RT System itself - Status changed from 'new' to 'open'

Sun Jan 08 17:57:17 2017 pmqs [...] cpan.org - Status changed from 'open' to 'patched'

Tue Jan 10 09:28:40 2017 PERLMONKY [...] cpan.org - Correspondence added

Hey Paul, thanks for the reply... didn't honestly expect a reply that quickly! ;) I'm just kiding... but I must honestly say I was desperately searching for a solution for this problem (since 2015!) and it seemed like (and I think it really was the case that) nobody could help me (didn't receive a reply here, nobody could help me on various perl-related IRC channels, I even tried to email you directly... but without success). That is also what makes me wondering... did you see the issue/problem in the past but you too were unsure what the problem could be ? If that was the case, it would suggest that it is somehow non-intuitive that we need to temporarily store the intermediate results etc... and finally it would mean that my code (also through your eyes) looked perfectly fine and it didn't trigger an alert (about the AppenOutput problem). What I am trying to say is, that I still think that it is non-intuitive that AppendOutput defaults to 0 for all Encoders. On the other side, I see your point about the risk of breaking perl scripts that are in the wild (on the other hand... one could also discuss that maybe some of them are exactly having the same problem like I did and maybe just the authors didn't test them well enough. I instead, tested my code and that was the reason that I found this problem. But I guess not everybody would notice the problem immediately if the scripts were not tested carefully with various amount of data and especially very large random data!). So one could in theory also speculate that by fixing the code (instead of the documentation) in some cases it would even fix those perl scripts (instead of breaking them). I'm not sure about the exact number of scripts that are broken (because they forgot to set AppendOutput => 1), nor the exact number of scripts that do store the intermediate results (or who force to set AppendOutput => 1). I have the feeling that many users (of course me included) who use this module would indeed take it for granted that the documentation matches the code and that AppendOutput always was set to 1. Having that exact number would probably help to understand which fix would be better (code fix or documentation fix?). I can perfectly deal with your decision to "never break existing code" (again: not sure if by "fixing" the code to suit the "old" documentation would do more harm than good or vice-versa). What I really ask (beg? ;) ) you to do is to change the documentation in a way that it will deal with all these following points: 1. Fix the SYNOPSIS section and all examples within the documentation to make it clear that one needs to either store the output of both code () and flush () or specify AppendOutput => 1 2. Add a paragraph to the documentation explaining what code () and flush () do and why one needs to either store the result of both or specify AppendOutput => 1 for ALL encoders. Aka warn the user that for large inputs the overall output of LZMA is the concatenation of the results of code () and flush () and that the users of this modules should make sure to not forget that flush () might override the variable (if only a single variable was used) if AppendOutput was not set to 1. 3. Whenever the AppendOutput option is explained (i.e. where we find this sub string: "AppendOutput => 0|1") in the documentation, clearly state that one either needs to force AppendOutput to 1 or "externally" deal with the intermediate results (and concatenate them to have the whole compressed data)... this is the case for ALL encoders. 4. Add an easy noticeable note (highlighted at the top?) to the documentation (and maybe to other files too, like the "Changes" file) to make it clear that the documentation changed with version > 2.071 and what those changes imply (and maybe also explain that some older perl scripts which "forgot" to set AppendOutput to 1 may be broken). Essentially, add a warning messages that some scripts that were implemented with older versions (< 2.071) might be broken or could return unexpected results. For #1 I suggest to include this extended and fixed example: SYNOPSIS ... # AppendOutput defaults to 0 my ($lz, $status) = new Compress::Raw::Lzma::RawEncoder or die "Cannot create lzma object: $status\n"; $status = $lz->code($input, $output_code_part); # ... # use the intermediate encoder's output stored in $output_code_part # ... $status = $lz->flush($output_flush_part); # ... # use the second/final part of the encoder's output # Note: the concatenation of $output_code_part and $output_flush_part is equal to the full output of the LZMA encoder # ... # same example with concatenation of the intermediate outputs within this script my ($lz, $status) = new Compress::Raw::Lzma::RawEncoder or die "Cannot create lzma object: $status\n"; $status = $lz->code($input, $output_code_part); $status = $lz->flush($output_flush_part); my $output = $output_code_part . $output_flush_part; # same example with AppendOutput set to 1 (AppendOutput defaults to 0) my ($lz, $status) = new Compress::Raw::Lzma::RawEncoder (AppendOutput => 1) or die "Cannot create lzma object: $status\n"; $status = $lz->code($input, $output); $status = $lz->flush($output); ... Thank you again Paul

Sun Jan 15 14:13:30 2017 pmqs [...] cpan.org - Correspondence added

Thanks for the feedback. I will add more documentation to cover AppendOutput. cheers Paul

Fri Mar 10 19:03:25 2017 pmqs [...] cpan.org - Status changed from 'patched' to 'resolved'

Fri Mar 10 19:03:26 2017 pmqs [...] cpan.org - Fixed in 2.072 added