Skip Menu |

This queue is for tickets about the IO-Compress CPAN distribution.

Report information
The Basics
Id: 121545
Status: open
Priority: 0/
Queue: IO-Compress

People
Owner: Nobody in particular
Requestors: jeroen.vanwolffelaar [...] booking.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: IO::Uncompress::Gunzip silently does not process utf8 strings
Date: Sat, 6 May 2017 18:39:26 +0200
To: bug-IO-Compress [...] rt.cpan.org
From: Jeroen van Wolffelaar <jeroen.vanwolffelaar [...] booking.com>
This is a bug report for perl from jeroen.vanwolffelaar@booking.com, generated with the help of perlbug 1.39 running under perl 5.18.2. ----------------------------------------------------------------- Calling my $text = "This is the raw text"; gzip \$text, \my $gz or die; my $utf8gz = $gz; utf8::upgrade($utf8gz); gunzip \$utf8gz => $filename|\$uncompressed_data; succeeds (returns 1), while stuffing the input data into file $filename/variable $uncompressed_data, not decompressing anything. It should either decompress the string (even though it's wasteful/silly that it is utf8), or return an error. Considering that File::Slurp::write_file just stuffs the (utf8) data without complaining as bytes to a file, I'd expect gunzip to treat such utf8 variable exactly the same way and with the same interpretation. [Please do not change anything below this line] ----------------------------------------------------------------- --- Flags: category=library severity=medium module=IO::Uncompress::Gunzip --- Site configuration information for perl 5.18.2: Configured by root at Mon Feb 20 16:49:45 CET 2017. Summary of my perl5 (revision 5 version 18 subversion 2) configuration: Platform: osname=linux, osvers=, archname=x86_64-linux uname='linux gnulinux ' config_args='' Compiler: cc='gcc', ccflags ='', optimize='-O2', cppflags='' ccversion='', gccversion='', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags ='' libpth=/usr/lib64 libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl' cccdlflags='-fPIC', lddlflags='-shared -O2 -fstack-protector' Locally applied patches: --- @INC for perl 5.18.2: lib --- Environment for perl 5.18.2: HOME=/home/jvanwolffela LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/jvanwolffela/perl5/bin PERL5LIB=lib PERL_BADLANG (unset) PERL_LOCAL_LIB_ROOT=/home/jvanwolffela/perl5:/home/jvanwolffela/perl5 PERL_MB_OPT=--install_base "/home/jvanwolffela/perl5" PERL_MM_OPT=INSTALL_BASE=/home/jvanwolffela/perl5 SHELL=/bin/bash --Jeroen
Hey Jeroen, thanks for the feedback. I'm not clear what you expect to happen by running utf8::upgrade on binary data. Flagging it as UTF8 doesn't make sense. my $text = "This is the raw text"; gzip \$text, \my $gz or die; my $utf8gz = $gz; utf8::upgrade($utf8gz); Please shout if I'm missing something about what your are trying to do. Also, your point about File::Slurp::write_file just working without complaining is not what I saw when I tried it $ perl -MFile::Slurp -e 'write_file("/tmp/abc", "\x{20ac}\n")' Wide character in syswrite at /usr/local/share/perl5/File/Slurp.pm line 506. To actually get the file written, I had to explicitly tell File::Slurp that it was writing utf8, like this $ perl -MFile::Slurp -e 'write_file("/tmp/abc", {binmode => ":utf8"}, "\x{20ac}\n")' cheers Paul
Subject: Re: [External] [rt.cpan.org #121545] IO::Uncompress::Gunzip silently does not process utf8 strings
Date: Wed, 10 May 2017 00:54:27 +0200
To: bug-IO-Compress [...] rt.cpan.org
From: Jeroen van Wolffelaar <jeroen.vanwolffelaar [...] booking.com>
Hi, The utf8::upgrade statement is for demonstration purposes and to have a small, self-contained reproduction scenario. Indeed it doe not make any sense at all. The real scenario was much more convoluted, involving http libraries, unrelated to the scope of this (IMHO still) bug in IO::Compress: the end result is as in my reproduction demo, that a string that is supposed to have (and indeed has) gzip binary data, is marked (inadvertently) utf8. When this string was written to a file via File::Slurp, and later picked up by gunzip's filename support, things worked swell. When I 'optimised' to not write (needlessly) to disk, things went awry and caused me a bit of debugging pain. Needless to say I expected this change to not have any change in output. My point of view is that either: - gunzip does *not* find utf8 strings acceptable input. It should then not return anything to its second argument, and return '0' itself indicating failure, perhaps emitting a warning and/or setting $!. - gunzip *does* find utf8 strings acceptable (although a look of disapproval is in order). In this case, it should actually decompress (this is the behaviour that File::Slurp::write_file() choses: write bytes to file). The current behaviour, writing the input verbatim to the output, *without* decompressing, is IMHO plain wrong behaviour, and poor error handling. /usr/bin/gzip doesn't do this, not does zlib (C or perl version), etc. As to your observation that you cannot write utf8 strings without complaints to a file using F::S::write_file: you can, as long as all characters are in the latin1 range (U+00 through U+FF). Which is the same assumption that utf8::downgrade makes: that you just want plain old latin1, however wrong that is, but at least it's probably consistent with how you got your 'utf8' marked gzipped binary data in the first place. In an ideal world, people would never automatically convert between 'array of bytes' and 'string of unicode codepoints', but that's not what perl currently does. Thanks, --Jeroen On Tue, May 9, 2017 at 5:06 PM, Paul Marquess via RT < bug-IO-Compress@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=121545 > > > Hey Jeroen, > > thanks for the feedback. > > I'm not clear what you expect to happen by running utf8::upgrade on binary > data. Flagging it as UTF8 doesn't make sense. > > my $text = "This is the raw text"; > gzip \$text, \my $gz or die; > my $utf8gz = $gz; > utf8::upgrade($utf8gz); > > Please shout if I'm missing something about what your are trying to do. > > > Also, your point about File::Slurp::write_file just working without > complaining is not what I saw when I tried it > > $ perl -MFile::Slurp -e 'write_file("/tmp/abc", "\x{20ac}\n")' > Wide character in syswrite at /usr/local/share/perl5/File/Slurp.pm line > 506. > > To actually get the file written, I had to explicitly tell File::Slurp > that it was writing utf8, like this > > $ perl -MFile::Slurp -e 'write_file("/tmp/abc", {binmode => ":utf8"}, > "\x{20ac}\n")' > > > cheers > Paul > >
Hey Jeroen, aaah, I see what your problem is now! Your example of how File::Slurp::write_file worked made me think your issue was with the compressed payload data, rather than with the complete gzip container. You can get gunzip to complain by setting the "Transparent" option to 0, like this gunzip \$utf8gz => \my $buffer, Transparent => 0 or die "cannot gunzip: $GunzipError\n"; Does that solve your issue? Paul
Subject: Re: [External] [rt.cpan.org #121545] IO::Uncompress::Gunzip silently does not process utf8 strings
Date: Wed, 10 May 2017 11:16:21 +0200
To: bug-IO-Compress [...] rt.cpan.org
From: Jeroen van Wolffelaar <jeroen.vanwolffelaar [...] booking.com>
Ah... I didn't know about that option. It surprises me -- well, the option has its uses I guess, but it being default '1' is, let's put it this way, not a decision I would have taken :). Unless you read through all the options in perldoc and notice "Transparent", what it does, and its default (I didn't notice, while I did read a fair share of IO::Uncompress::Gunzip, and the FAQ, while debugging and before filing a bug report), you expect this program: my $text = "this is not compressed data\n"; gunzip \$text => \my $result or die; print $result; to die, but instead, it just prints "this is not compressed data"; At the same time, I understand that changing the default of Transparant is... not an option, for backwards compatibility reasons. However, the perldoc page could be very warning-ish and vocal about it -- that if you have an error in your input (like, not actual gz data), you will just get your input back. And the examples could list Transparent => 0 in them all to show that this is a conservative default. That is a spin off of this bug then, "Transparent => 1 is default and people may not expect that" -- with, I guess, only documentation changes as an action point (if you agree). The actual core part of this bug would then be, rephrased: "Under Transparent => 1, gunzip detects utf8-encoded gzip data as "not gz-data", and hence passes it through verbatim". Given that you can easily end up with utf8 data through interfacing libraries etc etc if you do things wrong, and things like 'print', writing to file, and basically all normal I/O operations "just" still work without complaining, I would still expect also gunzip to "just" still work under this condition, and behave as if the data is bytes turned to utf8 with utf8::upgrade. --Jeroen On Wed, May 10, 2017 at 10:34 AM, Paul Marquess via RT < bug-IO-Compress@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=121545 > > > Hey Jeroen, > > aaah, I see what your problem is now! Your example of how > File::Slurp::write_file worked made me think your issue was with the > compressed payload data, rather than with the complete gzip container. > > You can get gunzip to complain by setting the "Transparent" option to 0, > like this > > gunzip \$utf8gz => \my $buffer, Transparent => 0 > or die "cannot gunzip: $GunzipError\n"; > > Does that solve your issue? > > Paul > >
I suspect if I was writing this again I wouldn't make Transparent the default. Agree though that this needs documented.