Skip Menu |

This queue is for tickets about the File-Slurp CPAN distribution.

Report information
The Basics
Id: 84918
Status: open
Priority: 0/
Queue: File-Slurp

People
Owner: cwhitener [...] gmail.com
Requestors: corion [...] corion.net
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: (no value)
Fixed in: (no value)



Subject: read_file() ignores binmode option for short files
Date: Mon, 29 Apr 2013 21:50:18 +0200
To: bug-File-Slurp [...] rt.cpan.org
From: Max Maischein <corion [...] corion.net>
Hello, thanks for writing File::Slurp. I noticed a bug in File::Slurp which leads to bad data being read. The binmode option is ignored in the code path for short files. Especially when reading and writing text files on Windows using {binmode => ':raw'}, but also when processing UTF-8 files, this is quite bad. The quick workaround is to simply delete that wrong optimization at the start of read_file(). If you want to keep the code path for short files, you will have to come up with your own way of reimplementing IO layers, or at least detect :raw and likely :utf-8 layers and act on them appropriately. Especially the line to "fix" Windows input does not seem prudent: $buf =~ s/\015\012/\n/g if $is_win32 ; Thanks for looking at this, -max
Show quoted text
> I noticed a bug in File::Slurp which leads to bad data being read. The > binmode option is ignored in the code path for short files. Especially > when reading and writing text files on Windows using {binmode => > ':raw'}, but also when processing UTF-8 files, this is quite bad.
Reviewing this bug, the problem is not in the short path, but in the long path, which does not cope with read_file( 'file.txt', { binmode => ':crlf' }); or read_file( 'file.txt', { binmode => ':encoding(Latin-1)' }); on Windows, due to the hand-rolled "fixup" of newlines under the assumption that all binmode arguments need to trigger this. I've attached a test file for this. The tests fail under Windows currently.
Subject: newline_binmode.t
use strict; use warnings; use IO::Handle (); use File::Basename (); use File::Spec (); use lib File::Spec->catdir(File::Spec->rel2abs(File::Basename::dirname(__FILE__)), 'lib'); use FileSlurpTest qw(temp_file_path); use File::Slurp qw(read_file write_file); use Test::More; plan tests => 6; my $binmode; for (':encoding(Latin-1)', ':crlf', ':raw') { $binmode = $_; my $data = "\n\n\n"; my $file_name = temp_file_path(); stdio_write_file($file_name, $data); my $slurped_data = read_file($file_name, { binmode => $binmode }); my $stdio_slurped_data = stdio_read_file( $file_name ) ; print 'data ', unpack( 'H*', $data), "\n", 'slurp ', unpack('H*', $slurped_data), "\n", 'stdio slurp ', unpack('H*', $stdio_slurped_data), "\n"; is($data, $slurped_data, "slurp ($binmode)"); write_file($file_name, { binmode => $binmode }, $data ); $slurped_data = stdio_read_file($file_name); is($data, $slurped_data, "spew ($binmode)"); unlink $file_name; }; sub stdio_write_file { my ($file_name, $data) = @_; open (my $fh, '>', $file_name) || die "Couldn't create $file_name: $!"; binmode $fh, $binmode; $fh->print($data); } sub stdio_read_file { my ($file_name) = @_; open (my $fh, '<', $file_name ) || die "Couldn't open $file_name: $!"; binmode $fh, $binmode; local $/; my $data = <$fh>; return $data; }
Hi Corion, I believe this to be resolved in the current release that refactored read_file(). Thanks, Chase
Subject: Re: [rt.cpan.org #84918] read_file() ignores binmode option for short files
Date: Sun, 3 Feb 2019 07:40:33 +0100
To: bug-File-Slurp [...] rt.cpan.org
From: Max Maischein <corion [...] corion.net>
Hello Chase, thank you very much for working on File::Slurp! Unfortunately, the problem is not fixed (see test attached to this ticket and to https://github.com/perhunter/slurp/pull/19 . The new version does not cope properly with {binmode => ':encoding(Latin-1)'} for example, because it does _not_ apply the :crlf handling in that situation when normal reading would. -max Am 03.02.2019 um 01:08 schrieb Chase Whitener via RT: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=84918 > > > Hi Corion, > > I believe this to be resolved in the current release that refactored read_file(). > > Thanks, > Chase >
This problem still exists. Please see the attached test. How can I help so that instead of closing tickets you run the tests I include? I have already created a pull request to add the test at https://github.com/perhunter/slurp/pull/19 , which remains unmerged. I'm at a loss how to support you better here.
Apologies for closing the ticket out prematurely. On Sat Mar 09 08:54:19 2019, CORION wrote: Show quoted text
> This problem still exists. Please see the attached test. > > How can I help so that instead of closing tickets you run the tests I > include? I have already created a pull request to add the test at > https://github.com/perhunter/slurp/pull/19 , which remains unmerged. > I'm at a loss how to support you better here.
9:11 PM <genio> Does anyone quite know how I should go about https://rt.cpan.org/Public/Bug/Display.html?id=84918 ? 9:11 PM <dipsy> [ Bug #84918 for File-Slurp: read_file() ignores binmode option for short files ] 9:11 PM <genio> No matter what I do to accommodate that problem, I break binmode.t on Windows 9:14 PM <Grinnz> so if i'm reading correctly the problem is that it opens handles with :raw by default 9:14 PM <Grinnz> thus default layers never get applied 9:15 PM <genio> yea 9:15 PM <shadowpaste> "genio" at 217.168.150.38 pasted "attempt_one.diff" (24 lines) at http://paste.scsys.co.uk/583733 9:15 PM <dipsy> [ magnet_web paste from "genio" at 217.168.150.38... ] 9:16 PM <Grinnz> do you know why :raw is even used? 9:16 PM <genio> That allows the current test suite to pass, but makes his attached test fail on the :encoding(Latin-1) test 9:16 PM <Grinnz> you know what better question, what does File::Slurper do 9:16 PM <genio> Grinnz: to not break back-compat with the old way of doing things and to deal with file handles and file paths similarly 9:17 PM <Grinnz> right, File::Slurper has a separate read_text that can care about default layers 9:17 PM <genio> I don't know that it's possible to fix without killing bugwards compatibility 9:17 PM <Grinnz> though actually even there it has a special case for the crlf layer on windows 9:18 PM <Grinnz> so, i guess you just gotta do that 9:18 PM <Grinnz> but yeah, it will result in different results on windows so it would be backwards incompatible 9:19 PM <Grinnz> oh, maybe not since it did that postprocessing to emulate crlf 9:19 PM <Grinnz> but only without layers passed 9:20 PM <Grinnz> i'd say it's impossible to know in File::Slurper whether the user desires text or binary processing of the file in general 9:20 PM <Grinnz> er, in File::Slurp 9:21 PM <Grinnz> i guess if they pass :utf8 or any :encoding layer you can assume text processing 9:22 PM <Grinnz> but according to the workaround that was there, basically any binmode passed would assume binary processing 9:23 PM <Grinnz> so tldr: yes it would be incompatible, tell them to use File::Slurper for a less buggy module 9:24 PM <genio> Either that or just go ahead and break bugwards compatibility on Windows because who the hell's been using windows!? :) 9:24 PM <genio> I kid. mostly 9:25 PM <Grinnz> you've got a few hours left in the states for an april fools release that just forwards to File::Slurper ;) 9:40 PM <genio> so, I guess I'm at a stopping point for getting any work done on File::Slurp. anything I do at this point breaks something 9:43 PM <Grinnz> i think "making it not completely break" is plenty 9:52 PM <haarg> binmode.t tests that providing a binmode at all causes it to avoid newline translation 9:53 PM <haarg> so yeah i don't think there's anything you can do without breaking one of the existing tests 9:53 PM <haarg> aside from documenting it 9:55 PM <Grinnz> i assume that "feature" stems from the misconception of binmode being used only to set the raw layer 9:58 PM <haarg> well, that test is testing :utf8 9:59 PM <Grinnz> that's a whole other ball of fun 10:04 PM <haarg> otherwise https://gist.github.com/haarg/60d51a5c675e2076b2b8aac6b3adb3bc would be a sensible fix 10:04 PM <Grinnz> an incompatible one, but yeah 10:05 PM <haarg> breaks binmode.t on windows, yeah 10:06 PM <Grinnz> it would also slow down a lot of cases that don't pass binmode but should 10:13 PM <haarg> perl and io on windows are already slow enough that i doubt anyone would notice 10:13 PM <Grinnz> i was talking about on unix 10:14 PM <haarg> :raw does nothing on unix 10:14 PM <Grinnz> isn't it what binmode with no argument applies? 10:15 PM <haarg> yes 10:15 PM <haarg> and it does nothing 10:17 PM <Grinnz> i guess i'm conflating the effect on non-slurpy readline with performance 10:18 PM <haarg> it can turn off other layers that were applied, but used on its own it does nothing 10:19 PM <haarg> there is the :unix 'layer' that can be used and provide some speed improvements 10:21 PM <Grinnz> maybe that's what i'm thinking of 10:21 PM <Grinnz> layers are a fucking mess tbh 10:21 PM <haarg> yeah 10:25 PM <haarg> by default there is a unix layer and a perlio layer. the perlio layer ignores the unix layer. applying the unix layer removes the perlio layer. the utf8 layer sets a flag on the perlio layer. i think the default windows behavior works similarly but doesn't have a name that can be applied. 10:29 PM <haarg> ah on windows it gets the crlf layer instead of the perlio layer 10:29 PM <haarg> but still has the unix layer 10:29 PM <Grinnz> naturally 10:29 PM <haarg> makes perfect sense 10:29 PM <Grinnz> File::Spec does it, why not 10:30 PM <haarg> and if you use :raw, it turns off the crlf flag on the crlf layer 10:30 PM <Grinnz> 🤨 10:30 PM <haarg> it's still there though, because it's the thing actually doing the io work ... 10:32 PM <haarg> genio: document it as needing to use :raw:encoding(Latin-1) if you need consistency 10:32 PM <Grinnz> i think for File::Slurp backcompat is all it has going for it 10:33 PM <haarg> ^ 10:34 PM <genio> leave it as is has been voted for. anyone want to write up a bit of documentation on how the current approach is broken for write_file() ? ... 10:36 PM <Grinnz> i thought we were talking about read_file ... 10:36 PM <genio> gah. yes. I errantly said write_file. meant read_file 10:40 PM <haarg> it's essentially: without binmode is default perl behavior (crlf translation on windows), with binmode is starting from :raw. if you want crlf translation and unicode handling, use :encoding(UTF-8):crlf 10:41 PM <Grinnz> read_file currently skips CRLF translation on Windows if any binmode is passed, even text encoding layers. For a more straightforward way to read and decode a file as text, try L<File::Slurper/read_text>. 10:41 PM <Grinnz> or yeah mention the manual use of :crlf 10:43 PM <Grinnz> applying :crlf manually won't be portable to unix though right 10:44 PM <haarg> crlf works fine on unix 10:44 PM <Grinnz> right but i mean, it will still function 10:44 PM <Grinnz> where on unix you don't want it to generally 10:45 PM <haarg> if it's a text file it won't really do any harm 10:45 PM <haarg> for read_file it's perfectly sensible to use on both windows and unix
Corion, I don't think there's much we can do here. I've asked around in the hopes that it was just me not being creative enough, but it doesn't appear that there's a solution that won't break some already defined behavior. I'd be happy to entertain ideas, though. Thanks, Chase
Subject: Re: [rt.cpan.org #84918] read_file() ignores binmode option for short files
Date: Fri, 5 Apr 2019 08:26:43 +0200
To: bug-File-Slurp [...] rt.cpan.org
From: Max Maischein <corion [...] corion.net>
Hello Chase, Show quoted text
> I don't think there's much we can do here. I've asked around in the hopes that it was just me not being creative enough, but it doesn't appear that there's a solution that won't break some already defined behavior. I'd be happy to entertain ideas, though.
Yes - personally, I think that read_file() should behave like do { local $/; <> } does, since originally File::Slurp had been intended as "simply" a faster implementation. But I think the design flaws in File::Slurp (and our adherence to bugwards compatibility) will prevent us from fixing this flaw on Windows. The bug has always been there on Windows, but on the upside, there is no proclamation in the module that it is the best module to read data from a file anymore. I haven't spent time with the reworked code, but maybe simply replacing all that code with: if( $binmode ) { binmode $fh => $binmode; }; { local $/; <$fh> } makes the test suite pass, but I consider the (original) test suite potentially flawed, most likely due to infamiliarity of the original author with Windows. Thanks for investigating this, -max
Well, I don't think that it was that. We have to remember that this module was written well before Perl layers and unicode and ... It was doing the right thing for a very long time. So, we're still at somewhat of an impasse. While a large part of me sees it as kind of OK to break current practice in favor of doing the right thing on Windows as well, the other part of me does not agree. Two options: 1) Keep the current functionality and document the bug on Windows. This documentation would need to explain the problem and the reasoning for not fixing it. 2) Break back-compat on Windows and let the Perl layers do the line ending conversions for us on the various user-supplied layers via binmode. This would break some assumptions about how the _module_ works on Windows, but would comply with most people's assumptions about what the code _should_ be doing. This would also need a heaping helping of documentation. I don't want to make a BDFL, fist-on-the-table declaration about what to do here. This may be vote time. Currently, there are two maintainers, myself and Uri. I am not sure how Uri feels about this topic as we haven't yet discussed it. My work thus far has been strongly focusing on _NOT_ breaking any backwards compatibility. I would not want to do anything at all here without Uri's and maybe some sort of other vote process in place. It's my opinion that Uri's vote overrides whatever vote I may cast and even that of any type of community vote. Uri, what are your opinions? There's a lot going on in this ticket and much to digest. Also, just because I can only think of the two options above doesn't mean that there isn't a third or nth. If you have other options, I'd love to hear them. Thanks, Chase
Subject: Re: [rt.cpan.org #84918] read_file() ignores binmode option for short files
Date: Sat, 27 Apr 2019 22:53:36 -0400
To: bug-File-Slurp [...] rt.cpan.org
From: Uri Guttman <uri [...] stemsystems.com>
On 4/20/19 10:48 AM, Chase Whitener via RT wrote: Show quoted text
> Queue: File-Slurp > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84918 > > > Well, I don't think that it was that. We have to remember that this module was written well before Perl layers and unicode and ... It was doing the right thing for a very long time.
yep. i was aiming just to make it easy and fast to slurp in plain text files. in my long perl career, i have yet had to deal directly with unicode (much to my happiness! :). Show quoted text
> > So, we're still at somewhat of an impasse. While a large part of me sees it as kind of OK to break current practice in favor of doing the right thing on Windows as well, the other part of me does not agree. > > Two options: > > 1) Keep the current functionality and document the bug on Windows. This documentation would need to explain the problem and the reasoning for not fixing it. > > 2) Break back-compat on Windows and let the Perl layers do the line ending conversions for us on the various user-supplied layers via binmode. This would break some assumptions about how the _module_ works on Windows, but would comply with most people's assumptions about what the code _should_ be doing. This would also need a heaping helping of documentation. > > I don't want to make a BDFL, fist-on-the-table declaration about what to do here. This may be vote time. Currently, there are two maintainers, myself and Uri. I am not sure how Uri feels about this topic as we haven't yet discussed it. My work thus far has been strongly focusing on _NOT_ breaking any backwards compatibility. I would not want to do anything at all here without Uri's and maybe some sort of other vote process in place. It's my opinion that Uri's vote overrides whatever vote I may cast and even that of any type of community vote.
the problem is that i still don't get all the problems and issues here. i just coded up line endings and nothing else. i passed along binmode as it was for real binary files. i hadn't thought that unicode files were considered binary or whatever. i am showing my massive ignorance on this topic. i still don't see why we can't bypass all the layers and such (this is a general question, not specifically slurp related). my view on encoding/decoding (which i do have experience with) is to do them at the edge of the system and keep a standard internal format. when reading in a unicode file, i say treat it raw (no layers), then the user should deal with the encoding/decoding then. in fact the terms encode/decode make no sense for unicode. it is more a translate or something. the module Encode is very poorly named. i don't see why perl has to get involved in inferring what the coder really wants. it should be explicit what to do with the text from the file. Show quoted text
> Uri, what are your opinions? There's a lot going on in this ticket and much to digest. Also, just because I can only think of the two options above doesn't mean that there isn't a third or nth. If you have other options, I'd love to hear them. > >
as you say there is a lot going on. one part of me hates redmond and says break it! use linux if you want proper support and slurping. the other say leave the damned bug, document the hell out of it. windows users won't read the docs anyhow and they get screwed either way. if i am not making sense to you, i am in agreement with it not making sense to me. so my gut says, document the bug and show some reasonable workaround. perl already has issues with windows (ever try to do terminal i/o with select on windows?? hell on earth). hope this helped. i doubt it did. :/ thanx, uri
On Sat Apr 27 22:53:52 2019, uri@stemsystems.com wrote: Show quoted text
> On 4/20/19 10:48 AM, Chase Whitener via RT wrote:
> > Queue: File-Slurp > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=84918 > > > > > Well, I don't think that it was that. We have to remember that this > > module was written well before Perl layers and unicode and ... It was > > doing the right thing for a very long time.
> > yep. i was aiming just to make it easy and fast to slurp in plain text > files. in my long perl career, i have yet had to deal directly with > unicode (much to my happiness! :).
> > > > So, we're still at somewhat of an impasse. While a large part of me > > sees it as kind of OK to break current practice in favor of doing the > > right thing on Windows as well, the other part of me does not agree. > > > > Two options: > > > > 1) Keep the current functionality and document the bug on Windows. > > This documentation would need to explain the problem and the > > reasoning for not fixing it. > > > > 2) Break back-compat on Windows and let the Perl layers do the line > > ending conversions for us on the various user-supplied layers via > > binmode. This would break some assumptions about how the _module_ > > works on Windows, but would comply with most people's assumptions > > about what the code _should_ be doing. This would also need a heaping > > helping of documentation. > > > > I don't want to make a BDFL, fist-on-the-table declaration about what > > to do here. This may be vote time. Currently, there are two > > maintainers, myself and Uri. I am not sure how Uri feels about this > > topic as we haven't yet discussed it. My work thus far has been > > strongly focusing on _NOT_ breaking any backwards compatibility. I > > would not want to do anything at all here without Uri's and maybe > > some sort of other vote process in place. It's my opinion that Uri's > > vote overrides whatever vote I may cast and even that of any type of > > community vote.
> the problem is that i still don't get all the problems and issues > here. > i just coded up line endings and nothing else. i passed along binmode > as > it was for real binary files. i hadn't thought that unicode files were > considered binary or whatever. i am showing my massive ignorance on > this > topic. > > i still don't see why we can't bypass all the layers and such (this is > a > general question, not specifically slurp related). my view on > encoding/decoding (which i do have experience with) is to do them at > the > edge of the system and keep a standard internal format. when reading > in > a unicode file, i say treat it raw (no layers), then the user should > deal with the encoding/decoding then. in fact the terms encode/decode > make no sense for unicode. it is more a translate or something. the > module Encode is very poorly named. i don't see why perl has to get > involved in inferring what the coder really wants. it should be > explicit > what to do with the text from the file.
I'm inevitably going to word things poorly in my attempt at explaining a bit, so apologies in advance (https://perldoc.pl/PerlIO for more official wording than anything I might produce here). Perl, since Perl 5.8, is unicode-aware. Perl tries to be helpful and solves some decoding and line ending problems with the use of layers. Part of that problem solving is to also do line-ending translation on OSes where that makes sense. Yes, we could leave this up to the user to solve each time individually, but that usually leads to bad copy-pasta code that doesn't do the right thing. If I'm opening and reading in a UTF-8 encoded text file, I'd typically `open my $fh, '<:encoding(UTF-8)', '/path/file.txt';` and then when I read anything in, it will already be decoded into the internal representation of characters rather than the raw UTF-8 bytes: `my $chars = <$fh>;`. Alternately, I could `open my $fh, '<:raw', '/path/file.txt';` and then anytime I readline something in, it will be in raw bytes. In order to understand those bytes as text characters, I'd need to `my $bytes = <$fh>; my $chars = Encode::decode('UTF-8', $bytes, Encode::FB_CROAK | Encode::LEAVE_SRC);`. This is more to code and debug and is more work for me. The first way would automatically handle line ending translation for me where applicable, whereas the second would not; I'd also have to handle that. Prior to Perl 5.8, this was not an issue as Perl was not yet unicode-aware and we didn't tend to worry about what encoding a particular file was in. Nowadays, we are very concerned about what it is that we're receiving. For the most part it's UTF-8, but even that can't be assumed. Show quoted text
> > Uri, what are your opinions? There's a lot going on in this ticket > > and much to digest. Also, just because I can only think of the two > > options above doesn't mean that there isn't a third or nth. If you > > have other options, I'd love to hear them. > > > >
> as you say there is a lot going on. one part of me hates redmond and > says break it! use linux if you want proper support and slurping. the > other say leave the damned bug, document the hell out of it. windows > users won't read the docs anyhow and they get screwed either way. if i > am not making sense to you, i am in agreement with it not making sense > to me. > > so my gut says, document the bug and show some reasonable workaround. > perl already has issues with windows (ever try to do terminal i/o with > select on windows?? hell on earth).
I think I've come around to the point where I'd like to suggest we break it so that the module behaves as everyone expects Perl to behave. By default we act just as `open()` does, and if the user wants to add encodings/other layers, they can. And when they do, it behaves the same way across each OS rather than having the caveat of Windows. Show quoted text
> > hope this helped. i doubt it did. :/ > > thanx, > > uri
I'm fairly convinced that such a change wouldn't be seen as a negative, but rather as a fix to ensure each OS behaves as expected. I could work up the fix and run through yet another large smoke test session on Windows to see if we've broken anything on CPAN. Thoughts? Thanks, Chase