Bug #122108 for Text-CSV_XS: Bug report for Text::CSV

Thu Jun 15 16:30:38 2017 CLemmen [...] excelsiorintegrated.com - Ticket created

Subject:	Bug report for Text::CSV_XS
Date:	Thu, 15 Jun 2017 19:58:29 +0000
To:	"bug-Text-CSV_XS [...] rt.cpan.org" <bug-Text-CSV_XS [...] rt.cpan.org>
From:	Charles Stuart Lemmen <CLemmen [...] excelsiorintegrated.com>

Hi, we've had some trouble handling files with the Text::CSV_XS package and I believe I've narrowed down the issue. First some info: Package: Text::CSV_XS Version: 1.30 Perl: Strawberry Perl ver. 24, subversion 0 (v5.24.0) built for MSWin32-x64-multi-thread O/S: Windows 10 Home We typically init our Text::CSV_XS object reference thusly: my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1}); We also don't specify $csv->eol but let the package sort that out. When using this package in production, we typically process 1+n .csv files from a folder and get each row by calling the 'getline' method then checking the row results. What we've discovered is that though typically the $csv object ref won't set its internal representation of $csv->eol from the default of '' (empty string), when a file is encountered that has line-endings of a single carriage return (AKA "\r", \x0d), the call to 'getline' sets $csv->eol to a single carriage return. This works fine for this one file, and others like it, but if the next file to be processed has line endings of "\r\n" $csv->eol is NOT reset to '' and they will then be read in completely by a single call to 'getline' which naturally breaks things. It is clear from debugging that some magic is going in internally with such EOL=="\r" files since calling tell($csv_fh) for each "record" after the first call to 'getline' returns the size of the file meaning splitting of the lines is being done behind the scenes (tell() gives expected different results for each line of a file with "\r\n" line endings). The workaround for us is to go against the documentation's recommendation of, "...so it is probably safer to not specify eol at all.", do careful EOL discovery, and supply that to a call to $csv->eol() for each file and in certain cases simply do away with $csv->getline() and use "while(<$csv_fh>)" along with calls to $csv->parse() instead. I'm inclined to feel that if this package sometimes sets $csv->eol then it should also either unset it when necessary or set it every time to avoid this problem. Hope this makes sense and is helpful to you. Stuart Lemmen IT Development & Support Excelsior Integrated LLC 413-394-4340 clemmen@excelsiorintegrated.com<mailto:clemmen@excelsiorintegrated.com> www.excelsiorintegrated.com<http://www.excelsiorintegrated.com/> [Excelsior Integrated Small][MCM 3PL seal vector][MCM2017 logo-small2]

Download image001.png
image/png 12.3k

Download image004.png
image/png 21.2k

Download image003.jpg
image/jpeg 3.8k

Sun Jun 18 11:03:32 2017 HMBRAND [...] cpan.org - Correspondence added

Show quoted text

> The workaround for us is to go against the documentation's recommendation of, "...so it is > probably safer to not specify eol at all.", do careful EOL discovery, and supply that to a call > to $csv->eol () for each file and in certain cases simply do away with $csv->getline () and use > "while (<$csv_fh>)" along with calls to $csv->parse () instead.

The "best" workaround would be resetting your eol after every file. foreach my $file (@file_names) { open my $fh, "<", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { # ... } $csv->eol (undef); } The reason it is the way it is is two-fold: 1. Historic reasons (arguable) 2. Performance To explain the second: it is much faster to *know* the EOL the moment you hit it. The code is optimized for EOL's with a newline in it, because that is the way most OS's return the concept of a line. If the parser however detects that the EOL is a CR instead, it will store that knowledge to be able to parse the rest of the code faster and more reliable (e.g. as of this moment, the NL should not be considered a record separator anymore). An alternative could be an option to NOT store that fact and loose speed and get a possible conflict when the data also has NL characters. As the CSV parser object does NOT store the file handle it operates/operated on, it is (very) hard to tell if this is a different stream than the previous parse cycle and thus that state should be reset. I'm open to better alternatives. If documenting this with the workaround option as I coded above is enough a resolution for you, I'll at that to the docs. Thanks for the report, this is a valid case indeed.

Sun Jun 18 11:03:32 2017 The RT System itself - Status changed from 'new' to 'open'

Thu Jul 06 17:59:13 2017 CLemmen [...] excelsiorintegrated.com - Correspondence added

Subject:	RE: [rt.cpan.org #122108] Bug report for Text::CSV_XS
Date:	Thu, 6 Jul 2017 17:25:08 +0000
To:	"bug-Text-CSV_XS [...] rt.cpan.org" <bug-Text-CSV_XS [...] rt.cpan.org>
From:	Charles Stuart Lemmen <CLemmen [...] excelsiorintegrated.com>

After some thought, I think that what you suggest about making sure to always call: $csv->eol(undef); After every file is prudent and probably the best way to handle this issue. That and your adding documentation explaining why this could be necessary. I don't think this package should be altered to somehow try to keep track of the files it uses etc. since that's not really its purpose and, since it is entrenched as the go-to CSV parser, a change that dramatic might have wide ranging, negative repercussions. Better might be a sub-class (or wrapper package) whose job it is to handle CSV files themselves leaving the parsing/combining grunt work to Text::CSV_XS. That is just what I've done for my needs and so far it seems to work and test well; I made sure to document part of the reason for its existence which is to surmount this end-of-line issue. I noticed too that there are a number of other packages that utilize/wrap around Text::CSV_XS but a quick scan shows none of them employing this fix so they wouldn't have worked for my purposes anyway. I'm glad to have a simple solution to the problem I was having, as well as knowing more now about why it exists, as this package is a lifesaver/workhorse for me. Thanks. -Stu Show quoted text

-----Original Message----- From: H.Merijn Brand via RT [mailto:bug-Text-CSV_XS@rt.cpan.org] Sent: Sunday, June 18, 2017 11:04 AM To: Charles Stuart Lemmen <CLemmen@excelsiorintegrated.com> Subject: [rt.cpan.org #122108] Bug report for Text::CSV_XS <URL: https://rt.cpan.org/Ticket/Display.html?id=122108 >

> The workaround for us is to go against the documentation's > recommendation of, "...so it is probably safer to not specify eol at > all.", do careful EOL discovery, and supply that to a call to > $csv->eol () for each file and in certain cases simply do away with $csv->getline () and use "while (<$csv_fh>)" along with calls to $csv->parse () instead.

The "best" workaround would be resetting your eol after every file. foreach my $file (@file_names) { open my $fh, "<", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { # ... } $csv->eol (undef); } The reason it is the way it is is two-fold: 1. Historic reasons (arguable) 2. Performance To explain the second: it is much faster to *know* the EOL the moment you hit it. The code is optimized for EOL's with a newline in it, because that is the way most OS's return the concept of a line. If the parser however detects that the EOL is a CR instead, it will store that knowledge to be able to parse the rest of the code faster and more reliable (e.g. as of this moment, the NL should not be considered a record separator anymore). An alternative could be an option to NOT store that fact and loose speed and get a possible conflict when the data also has NL characters. As the CSV parser object does NOT store the file handle it operates/operated on, it is (very) hard to tell if this is a different stream than the previous parse cycle and thus that state should be reset. I'm open to better alternatives. If documenting this with the workaround option as I coded above is enough a resolution for you, I'll at that to the docs. Thanks for the report, this is a valid case indeed.

Mon Aug 28 10:54:39 2017 HMBRAND [...] cpan.org - Status changed from 'open' to 'rejected'

Bug #122108 for Text-CSV_XS: Bug report for Text::CSV_XS