Bug #35222 for Text-CSV_XS: Allowing parsing instead of error 2023.

Sat Apr 19 11:26:28 2008 mlf-bitcard [...] shoebox.net - Ticket created

Subject:

Allowing parsing instead of error 2023.

Currently, Text::CSV_XS errors on malformed quoting. For example, the input: "foo "bar" baz",quux Will cause an error 2023 (which, by the way, the documentation says there is no repeatable test for, despite the test suite having a test for it). With a minor change (patch attached), this row can be parsed as: ['foo "bar" baz', 'quux'] Granted, the original input is mangled, but it is still parseable. The change also allows similarly mangled CSV to be parsed in a predictable manner. The change does cause a few test suite failures, mostly for tests that involve verifying that certain constructs fail to be parsed. I would have provided additional fixes for the test suite, but I wanted to get an idea of how acceptable the patch is. I can understand not wanting to accept mangled CSV, though it's hard to pin down what's really mangled with such an unspecified format. In that case this change could be wrapped in an option check, say if (csv->allow_mangled_quotes) { ... }. I can also understand not wanting to clutter up the code with lots of checks to handled CSV that probably shouldn't be handled in the first place. However, at this point in the code there would be an error anyways, and parsing would stop, so it doesn't greatly affect performance of the module as a whole. So, I'm perfectly willing to make additional changes. I'd just like some direction before doing so.

Subject:

2023.patch

--- CSV_XS.xs~orig 2008-04-15 22:51:24.000000000 -0800 +++ CSV_XS.xs 2008-04-19 07:12:02.000000000 -0800 @@ -876,7 +876,9 @@ } } #endif - ERROR_INSIDE_QUOTES (2023); + + CSV_PUT_SV(sv, c); + CSV_PUT_SV(sv, c2); } } else

Sat Apr 19 13:12:00 2008 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #35222] Allowing parsing instead of error 2023.
Date:	Sat, 19 Apr 2008 19:11:36 +0200
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Sat, 19 Apr 2008 11:26:29 -0400, "Michael Fowler via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> Currently, Text::CSV_XS errors on malformed quoting. For example, the > input: > > "foo "bar" baz",quux > > Will cause an error 2023 (which, by the way, the documentation says > there is no repeatable test for, despite the test suite having a test > for it).

Test for that was added in 0.42. I forgot to update the docs. Now done. Show quoted text

> With a minor change (patch attached), this row can be parsed as: > > ['foo "bar" baz', 'quux']

I'll have a look if that can fall under the "allow_loose_quots". Show quoted text

> Granted, the original input is mangled, but it is still parseable. The > change also allows similarly mangled CSV to be parsed in a predictable > manner. > > The change does cause a few test suite failures, mostly for tests that > involve verifying that certain constructs fail to be parsed. I would > have provided additional fixes for the test suite, but I wanted to get > an idea of how acceptable the patch is.

Not in the way it is. Just accepting when there is an error is never acceptable. Show quoted text

> I can understand not wanting to accept mangled CSV, though it's hard to > pin down what's really mangled with such an unspecified format. In that > case this change could be wrapped in an option check, say if > (csv->allow_mangled_quotes) { ... }.

This, as said, could fall under allow_loose_quotes. Show quoted text

> I can also understand not wanting to clutter up the code with lots of > checks to handled CSV that probably shouldn't be handled in the first > place. However, at this point in the code there would be an error > anyways, and parsing would stop, so it doesn't greatly affect > performance of the module as a whole.

That is also one of my main points of interest Show quoted text

> So, I'm perfectly willing to make additional changes. I'd just like > some direction before doing so.

The problem with 2023 is that you don't know if it is a REAL error, like in 2023,",2008-04-05,"Foo, Bar",\n or something that could have been meant as correct, like "foo's quote is ", as said",1,0\n (try to tell me how that should be parsed without looking at the code) or as in your case "foo "bar" baz",quux\n -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.10.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.1 & 10.2, AIX 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/

Sat Apr 19 13:12:02 2008 The RT System itself - Status changed from 'new' to 'open'

Sun Apr 20 17:08:08 2008 mlf-bitcard [...] shoebox.net - Correspondence added

From:

mlf-bitcard [...] shoebox.net

On Sat Apr 19 13:12:00 2008, h.m.brand@xs4all.nl wrote: Show quoted text

> On Sat, 19 Apr 2008 11:26:29 -0400, "Michael Fowler via RT" > <bug-Text-CSV_XS@rt.cpan.org> wrote:

> > but I wanted to get an idea of how acceptable the patch is.

> > Not in the way it is. Just accepting when there is an error is never > acceptable.

There should be some room for accepting invalid input and getting a result that can be corrected. As it is now parsing simply stops, and if the user wants to recover a row they essentially have to manually parse $csv->error_input and any followup data in the row. The original input that spawned this was something along the lines of: "foo "bar baz" quux",something,something,"Additional data\r\nwith newlines\r\nand possibly other characters that require binary => 1" When Text::CSV_XS stops parsing the row it does so on the first line; the rest remain in the input. With the modification the row is actually parsed as it was intended, though this will probably not always be the case. I can think of no simple, automated way of correcting the input that doesn't involve completely reinventing a CSV parser. Piggy-backing this error-correction on top of allow_loose_quotes is certainly an option, but it seems to me this input is more severely mangled than the input that allow_loose_quotes parses, and thus deserves a separate option. Show quoted text

> > So, I'm perfectly willing to make additional changes. I'd just like > > some direction before doing so.

> > The problem with 2023 is that you don't know if it is a REAL error, like > in

I would venture to say that anything the 2023 error catches is a real error; it can also be a more ambiguous error than what I've presented so far. For example, with a parser set to binary => 1: something,"foo "bar\r\nbaz" quux",something\r\n This could either be taken as two rows, each with two columns, or one row with four columns. The intent is ambiguous, and normally this sort of thing should result in an error, with the parser simply stopping. However, it can be useful to attempt to make sense of the input regardless, and return back some result, that the user can recover. In my opinion the simplest and most robust change would be to add a new option, check that option where error 2023 is used; if that option is set, attempt to correct, if not, error as normal. The majority of CSV parses would not be affected greatly, either in terms of performance, or at all in terms of how the CSV is parsed. Something a little more elaborate would involve doing the above, but marking the row as "dirty" or ambiguous in some way.

Mon Apr 21 02:27:34 2008 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #35222] Allowing parsing instead of error 2023.
Date:	Mon, 21 Apr 2008 08:27:14 +0200
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

As this optionally new allow_something flag would only trigger on 2023, it will not likely influence performance much. There will be a minimal hit though, because it has to cache its value. As the situation here, is clearly some serious error, the name of this new parameter should be very clear in that this is accepting bad input, and possible results may be (very) broken. Also note that, in the case you present, the error being hit depends on the definition of escape_char. If I set that to undef, you will get # CSV_XS ERROR: 2011 - ECR - Characters after end of quoted field even when allow_loose_quotes is on. I seriously doubt if fixing adding a new option will make you happy. If you want to continue after errors, maybe you should use eval. -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.10.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.1 & 10.2, AIX 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/

Mon Apr 21 18:39:27 2008 HMBRAND [...] cpan.org - Correspondence added

Should work in 0.43 with this: my $csv = Text::CSV_XS->new ({ allow_loose_quotes => 1, escape_char => undef });

Mon Apr 21 18:39:29 2008 HMBRAND [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Apr 22 00:02:27 2008 mlf-bitcard [...] shoebox.net - Correspondence added

From:

mlf-bitcard [...] shoebox.net

On Mon Apr 21 02:27:34 2008, h.m.brand@xs4all.nl wrote: Show quoted text

> Also note that, in the case you present, the error being hit depends on > the definition of escape_char. If I set that to undef, you will get > > # CSV_XS ERROR: 2011 - ECR - Characters after end of quoted field > > even when allow_loose_quotes is on. I seriously doubt if fixing adding > a new option will make you happy. If you want to continue after errors, > maybe you should use eval. >

When I talk about error recovery I meant being able to continue parsing the badly formed CSV data. There is no fatal exception involved. Your escape_char fix (mentioned in another email) is certainly an option, but I'm afraid it doesn't really solve this situation. The problem is the CSV is otherwise well-formed, except for a few rows with rogue quotes. The other rows have escaped quotes in them, and it would be preferable for that escaping to be handled properly. I have attached a patch that adds an 'allow_bad_quoting' option, along with attendant tests and documentation (it's a -p1 patch, I believe). It's what I was thinking of from the beginning, though I'm thinking there is a more general solution to the problem of error recovery so that the entire input can be parsed even with various malformed elements. I'll have to put more thought into it. To be quite honest, the patch doesn't solve any problems I'm currently having. I do, however, very much like Text::CSV_XS; it is very fast and robust. I think my patch makes it a little more robust and flexible in terms of parsing bad CSV, and would like to see it applied in some form. Please, let me know if there's anything more I can do.

diff -urN Text-CSV_XS-0.42~orig/CSV_XS.pm Text-CSV_XS-0.42/CSV_XS.pm --- Text-CSV_XS-0.42~orig/CSV_XS.pm 2008-04-16 05:11:51.000000000 -0800 +++ Text-CSV_XS-0.42/CSV_XS.pm 2008-04-21 19:25:55.000000000 -0800 @@ -63,6 +63,7 @@ allow_loose_quotes => 0, allow_loose_escapes => 0, allow_whitespace => 0, + allow_bad_quoting => 0, blank_is_undef => 0, verbatim => 0, types => undef, @@ -110,6 +111,7 @@ allow_loose_escapes => 7, allow_double_quoted => 8, allow_whitespace => 9, + allow_bad_quoting => 24, blank_is_undef => 10, eol => 11, # 11 .. 18 @@ -217,6 +219,13 @@ $self->{allow_whitespace}; } # allow_whitespace +sub allow_bad_quoting +{ + my $self = shift; + @_ and $self->_set_attr ("allow_bad_quoting", shift); + $self->{allow_bad_quoting}; + } # allow_bad_quoting + sub blank_is_undef { my $self = shift; @@ -724,6 +733,25 @@ allow this format, we cannot help there are some vendors that make their applications spit out lines styled like this. +=item allow_bad_quoting + +By default, C<quote_char> characters in a field are strictly parsed. Any +non-escaped quote character causes the parser to simply fail. Turning this +option on will allow the parser to attempt to correct by bumping past the +offending quote character. For example, normally: + + 1,"foo "bar" baz",42 + +Would result in a parse error (2023). With allow_bad_quoting this will parse +into three fields: + + ("1", 'foo "bar" baz', "42") + +Keep in mind, this behaviour is fragile, and is to only be used as a last +resort to recover from badly formed CSV. If you use this option be very +careful to verify the results you get are what you intended; if at all +possible, correct the original CSV data and avoid this option. + =item escape_char The character used for escaping certain characters inside quoted fields. @@ -837,6 +865,7 @@ allow_loose_quotes => 0, allow_loose_escapes => 0, allow_whitespace => 0, + allow_bad_quoting => 0, blank_is_undef => 0, verbatim => 0, }); diff -urN Text-CSV_XS-0.42~orig/CSV_XS.xs Text-CSV_XS-0.42/CSV_XS.xs --- Text-CSV_XS-0.42~orig/CSV_XS.xs 2008-04-15 22:51:24.000000000 -0800 +++ Text-CSV_XS-0.42/CSV_XS.xs 2008-04-21 19:35:15.000000000 -0800 @@ -34,6 +34,7 @@ #define CACHE_ID_allow_loose_escapes 7 #define CACHE_ID_allow_double_quoted 8 #define CACHE_ID_allow_whitespace 9 +#define CACHE_ID_allow_bad_quoting 24 #define CACHE_ID_blank_is_undef 10 #define CACHE_ID_eol 11 #define CACHE_ID_eol_len 19 @@ -83,6 +84,7 @@ byte allow_loose_escapes; byte allow_double_quoted; byte allow_whitespace; + byte allow_bad_quoting; byte blank_is_undef; byte verbatim; @@ -209,6 +211,7 @@ csv->allow_loose_escapes = csv->cache[CACHE_ID_allow_loose_escapes]; csv->allow_double_quoted = csv->cache[CACHE_ID_allow_double_quoted]; csv->allow_whitespace = csv->cache[CACHE_ID_allow_whitespace ]; + csv->allow_bad_quoting = csv->cache[CACHE_ID_allow_bad_quoting ]; csv->blank_is_undef = csv->cache[CACHE_ID_blank_is_undef ]; csv->verbatim = csv->cache[CACHE_ID_verbatim ]; #endif @@ -286,6 +289,7 @@ csv->allow_loose_escapes = bool_opt ("allow_loose_escapes"); csv->allow_double_quoted = bool_opt ("allow_double_quoted"); csv->allow_whitespace = bool_opt ("allow_whitespace"); + csv->allow_bad_quoting = bool_opt ("allow_bad_quoting"); csv->blank_is_undef = bool_opt ("blank_is_undef"); csv->verbatim = bool_opt ("verbatim"); #endif @@ -303,6 +307,7 @@ csv->cache[CACHE_ID_allow_loose_escapes] = csv->allow_loose_escapes; csv->cache[CACHE_ID_allow_double_quoted] = csv->allow_double_quoted; csv->cache[CACHE_ID_allow_whitespace] = csv->allow_whitespace; + csv->cache[CACHE_ID_allow_bad_quoting] = csv->allow_bad_quoting; csv->cache[CACHE_ID_blank_is_undef] = csv->blank_is_undef; csv->cache[CACHE_ID_verbatim] = csv->verbatim; #endif @@ -875,9 +880,18 @@ /* uncovered */ goto restart; } } -#endif - ERROR_INSIDE_QUOTES (2023); + + if (csv->allow_bad_quoting) { + CSV_PUT_SV(sv, c); + CSV_PUT_SV(sv, c2); + } + else { + ERROR_INSIDE_QUOTES (2023); + } } +#else + ERROR_INSIDE_QUOTES (2023); +#endif } else /* !waitingForField, !InsideQuotes */ Binary files Text-CSV_XS-0.42~orig/.CSV_XS.xs.swp and Text-CSV_XS-0.42/.CSV_XS.xs.swp differ diff -urN Text-CSV_XS-0.42~orig/t/12_acc.t Text-CSV_XS-0.42/t/12_acc.t --- Text-CSV_XS-0.42~orig/t/12_acc.t 2008-02-25 23:24:40.000000000 -0900 +++ Text-CSV_XS-0.42/t/12_acc.t 2008-04-21 19:51:40.000000000 -0800 @@ -3,7 +3,7 @@ use strict; $^W = 1; # use warnings core since 5.6 -use Test::More tests => 44; +use Test::More tests => 46; BEGIN { use_ok "Text::CSV_XS"; @@ -23,6 +23,7 @@ is ($csv->allow_loose_quotes, 0, "allow_loose_quotes"); is ($csv->allow_loose_escapes, 0, "allow_loose_escapes"); is ($csv->allow_whitespace, 0, "allow_whitespace"); +is ($csv->allow_bad_quoting, 0, "allow_bad_quoting"); is ($csv->blank_is_undef, 0, "blank_is_undef"); is ($csv->verbatim, 0, "verbatim"); @@ -40,6 +41,7 @@ is ($csv->allow_loose_quotes (1), 1, "allow_loose_quotes (1)"); is ($csv->allow_loose_escapes (1), 1, "allow_loose_escapes (1)"); is ($csv->allow_whitespace (1), 1, "allow_whitespace (1)"); +is ($csv->allow_bad_quoting (1), 1, "allow_bad_quoting (1)"); is ($csv->blank_is_undef (1), 1, "blank_is_undef (1)"); is ($csv->verbatim (1), 1, "verbatim (1)"); is ($csv->escape_char ("\\"), "\\", "escape_char (\\)"); diff -urN Text-CSV_XS-0.42~orig/t/65_allow.t Text-CSV_XS-0.42/t/65_allow.t --- Text-CSV_XS-0.42~orig/t/65_allow.t 2008-04-05 14:02:10.000000000 -0800 +++ Text-CSV_XS-0.42/t/65_allow.t 2008-04-21 19:49:25.000000000 -0800 @@ -4,7 +4,7 @@ $^W = 1; #use Test::More "no_plan"; - use Test::More tests => 803; + use Test::More tests => 828; BEGIN { use_ok "Text::CSV_XS", (); @@ -153,6 +153,30 @@ } } +ok (1, "Allow bad quoting"); +# Allow unescaped quotes inside a quoted field +{ my @bad = ( + # valid, line + [ 1, 1, qq{foo,bar,"quux",quux}, ], + [ 2, 0, qq{rj,bs,"r"jb"s",rjbs}, ], + [ 3, 0, qq{"some "spaced" quote data",2,3,4}, ], + [ 4, 0, qq{"a lone " quote",foo,bar} ], + [ 5, 1, qq{and an,entirely,quoted,"field"}, ], + [ 6, 1, qq{and then,"one with ""quoted"" quotes",okay,?}, ], + ); + + for (@bad) { + my ($tst, $valid, $bad) = @$_; + $csv = Text::CSV_XS->new (); + ok ($csv, "$tst - new (alq => 0)"); + is ($csv->parse ($bad), $valid, "$tst - parse () fail"); + + $csv->allow_bad_quoting (1); + ok ($csv->parse ($bad), "$tst - parse () pass"); + ok (my @f = $csv->fields, "$tst - fields"); + } + } + ok (1, "blank_is_undef"); foreach my $conf ( [ 0, 0, 0, 1, "", " ", '""', 2, "", "", "" ],

Tue Apr 22 00:02:35 2008 The RT System itself - Status changed from 'resolved' to 'open'

Wed Apr 23 04:36:13 2008 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #35222] Allowing parsing instead of error 2023.
Date:	Wed, 23 Apr 2008 10:35:42 +0200
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Tue, 22 Apr 2008 00:02:36 -0400, "Michael Fowler via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> When I talk about error recovery I meant being able to continue parsing > the badly formed CSV data. There is no fatal exception involved.

Text::CSV_XS does not die or croak when parsing. it returns a false state from parse () or getline (). Text::CSV_XS also does not close the handle of it's input, so there is nothing stopping you from doing the right thing (TM) when running into problems. What was scanned is available in $csv->error_input (), and you could easily add the rest of the line (if available) to that and use whatever method you like to `fix' or analyze that particular problem. the values returned by $csv->error_diag () might be of some help. Show quoted text

> Your escape_char fix (mentioned in another email) is certainly an > option, but I'm afraid it doesn't really solve this situation.

Why not? If that doesn't address the original problem you stated, your line is probably beyond automatic parsing, and needs some human eyes to detect the problem. Note again that in many cases, with the default setting for quote_char and escape_char, there is no way to see if the " is meant as a quote or as an escape. That is why my suggested solution addresses all the automatable problems only when these two differ. Show quoted text

> The problem is the CSV is otherwise well-formed, except for a few rows > with rogue quotes. The other rows have escaped quotes in them, and it > would be preferable for that escaping to be handled properly. > > I have attached a patch that adds an 'allow_bad_quoting' option, along > with attendant tests and documentation (it's a -p1 patch, I believe).

However I value your work, I won't go that way. Once the state is beyond what I support now, I am in the opinion that it should be handled by the parent. If you expect bad lines, use some protection. However I promote the getline () calls over some personalized 'while (<>) {' + 'parse ()', going that way will provide you with all you need to `catch' bad lines and deal with those appropriately. Show quoted text

> It's what I was thinking of from the beginning, though I'm thinking > there is a more general solution to the problem of error recovery so > that the entire input can be parsed even with various malformed > elements. I'll have to put more thought into it.

again, the bogus parsed line is available in $csv->error_input (). Though that part is not extensively tested (I have no idea what happens if those lines are very very long), it is a very good starting point. Show quoted text

> To be quite honest, the patch doesn't solve any problems I'm currently > having. I do, however, very much like Text::CSV_XS; it is very fast and > robust.

I like to keep it that way, and allowing anything that is not (easy) controllable will put great pressure on the code. Show quoted text

> I think my patch makes it a little more robust and flexible in > terms of parsing bad CSV,

Flexible, yes, robust no. Show quoted text

> and would like to see it applied in some form. > Please, let me know if there's anything more I can do.

Sorry to disappoint you, but you have not convinced me. The more I think about it, the less I like the idea. And if you notice the delay in this reply, I think about it a lot lately. -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.10.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.1 & 10.2, AIX 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/

Wed Apr 23 06:10:48 2008 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #35222] Allowing parsing instead of error 2023.
Date:	Wed, 23 Apr 2008 12:10:21 +0200
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Wed, 23 Apr 2008 04:36:14 -0400, "h.m.brand@xs4all.nl via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> > It's what I was thinking of from the beginning, though I'm thinking > > there is a more general solution to the problem of error recovery so > > that the entire input can be parsed even with various malformed > > elements. I'll have to put more thought into it.

> > again, the bogus parsed line is available in $csv->error_input (). > Though that part is not extensively tested (I have no idea what happens > if those lines are very very long), it is a very good starting point.

I just did some tests with long lines (over 128k each) and the complete line is in the return value of $csv->error_input (). In the upcoming 0.44, I fixed the position in the 3rd return value of $csv->error_diag (). See examples/csv-check to see how to exactly reproduce the point of failure. I've used that example script to check *all* the CSV files on my PC, and indeed found some erroneous files. Oh joy! As above tests prove that no information is lost at all, I'm confident in that this new option is not needed. -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.10.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.1 & 10.2, AIX 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/

Wed Apr 23 06:27:45 2008 mlf-bitcard [...] shoebox.net - Correspondence added

From:

mlf-bitcard [...] shoebox.net

Yes, I had considered using $csv->error_input to recover from the bad input, but there was no reliable way of doing this. The actual row in the data file is difficult to reproduce reliably here, but here is a more complete example: "Etablissements Delhaize Freres et Cie "Le Lion" S.A.","DEG","$827.4","paragraph\r\nparagraph\r\nparagraph\r\n" The paragraphs were much longer, and could contain escaped quotes, delimiters, etc. Text::CSV_XS stops parsing on the first line, giving an error_input value of: '"Etablissements Delhaize Freres et Cie "Le Lion" S.A.","DEG","$827.4","paragraph\r\n' At this point, I cannot programmatically determine where the parser is in the text, how many further lines to read, how many lines have been read, etc. Text::CSV_XS is much more suited to determining when this record ends; doing so myself would require essentially re-implementing a CSV parser. The other solution I had considered was to somehow "push back" the error_input value, after correcting it, so that the parser would try to parse it from the beginning. This is certainly an option, but I initially discarded it because it would involve getline() checking first a buffer to see if there was a push-back value, then attempting to read the file. This sounded like a bad solution because it would affect all parses, on every getline() call. Show quoted text

> > Your escape_char fix (mentioned in another email) is certainly an > > option, but I'm afraid it doesn't really solve this situation.

> > Why not? If that doesn't address the original problem you stated, your > line is probably beyond automatic parsing, and needs some human eyes to > detect the problem.

It doesn't solve the situation because there are escaped quotes in the input. The patch I gave you causes the file to be parsed as the original provider intended, so it is indeed possible to automatically parse it. It might help to have the original input file. You can find it at http://p3m.org/code/Text-CSV_XS/symbols.csv . The source is the NASDAQ website, the original URL for this data is http://www.nasdaq.com//asp/symbols.asp?exchange=N&start=0 . Show quoted text

> I like to keep it that way, and allowing anything that is not (easy) > controllable will put great pressure on the code.

Show quoted text

> Flexible, yes, robust no.

The code is more robust for being able to handle erroneous input, and I don't really understand what kind of pressure this puts on it. If you don't like the changes, fine. I'm not particularly interested in pursuing it any longer. I think it would be beneficial to have a fast CSV parser that is capable of dealing with poorly formed CSV in a recoverable manner. I think this is a step towards that goal. If you do not agree, or you do not wish to make it such a parser, that is your prerogative; it is your code. I will not bother you any longer with these ideas. Good luck.

Wed Apr 23 07:04:02 2008 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #35222] Allowing parsing instead of error 2023.
Date:	Wed, 23 Apr 2008 13:03:35 +0200
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Wed, 23 Apr 2008 06:27:47 -0400, "Michael Fowler via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> Yes, I had considered using $csv->error_input to recover from the bad > input, but there was no reliable way of doing this. The actual row in > the data file is difficult to reproduce reliably here, but here is a > more complete example: > [snip] > At this point, I cannot programmatically determine where the parser is > in the text, how many further lines to read, how many lines have been > read, etc.

Neither can the parser of Text::CSV_XS, as it cannot tell if any EOL is part of a quoted field or the separator to a new line. Show quoted text

> Text::CSV_XS is much more suited to determining when this record ends; > doing so myself would require essentially re-implementing a CSV parser.

Or just continue after the error, hoping Text::CSV_XS' parser will do the right thing (TM) Show quoted text

> The other solution I had considered was to somehow "push back" the > error_input value, after correcting it, so that the parser would try to > parse it from the beginning. This is certainly an option, but I > initially discarded it because it would involve getline() checking first > a buffer to see if there was a push-back value, then attempting to read > the file. This sounded like a bad solution because it would affect all > parses, on every getline () call.

A modified parser could be something like this: --8<--- use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1 }); my $file = "symbols.csv"; open my $fh, "<", $file or die "$file: $!\n"; while (1) { my $row = $csv->getline ($fh); unless ($row) { $csv->eof and last; my @diag = $csv->error_diag; print STDERR "$file line $./$diag[2] - $diag[0] - $diag[1]\n"; my $ep = $diag[2] - 1; # diag[2] is 1-based my $err = $csv->error_input . " "; substr $err, $ep + 1, 0, "*"; substr $err, $ep, 0, "*"; ($err = substr $err, $ep - 5, 12) =~ s/ +$//; print " |$err|\n"; next; } # Data was fine, do something with $row } -->8--- If you pass me your real e-mail, I can sent you the formatted code, as RT probably ruined the layout here. If you want to parse the erroneous lines with other options, you just create a second csv object with more lenient options, and pass it $csv->error_input Show quoted text

> It might help to have the original input file. You can find it at > http://p3m.org/code/Text-CSV_XS/symbols.csv . The source is the NASDAQ > website, the original URL for this data is > http://www.nasdaq.com//asp/symbols.asp?exchange=N&start=0 .

Text-CSV_XS 155 > perl -Iblib/{lib.arch} examples/csv-check symbols.csv Cecked with Text::CSV_XS 0.43 symbols.csv line 10918/40 - 2023 - EIQ - QUO character not allowed | Cie *"*Le L| Exit 231 With the snippet from above: Text-CSV_XS 158 > perl -Iblib/{lib.arch} symbols.pl symbols.csv line 10918/40 - 2023 - EIQ - QUO character not allowed | Cie *"*Le L| symbols.csv line 10932/43 - 2034 - EIF - Loose unescaped quote |re...*"*"htt| symbols.csv line 22046/27 - 2023 - EIQ - QUO character not allowed |pany *"*Vimp| symbols.csv line 22058/85 - 2034 - EIF - Loose unescaped quote |re...*"*"htt| Show quoted text

> > I like to keep it that way, and allowing anything that is not (easy) > > controllable will put great pressure on the code.

>

> > Flexible, yes, robust no.

> > The code is more robust for being able to handle erroneous input, and I > don't really understand what kind of pressure this puts on it.

I hope to meet you in a YAPC::Europe event or something similar, so I can guide you through the code and show you the pitfalls I hit myself when trying to make the code maintainable. And even now, I guess many other maintainers might argue that it is maintainable by now. Show quoted text

> If you don't like the changes, fine. I'm not particularly interested in > pursuing it any longer. I think it would be beneficial to have a fast > CSV parser that is capable of dealing with poorly formed CSV in a > recoverable manner.

So do I, but I think I already showed that it is already that way. Show quoted text

> I think this is a step towards that goal. If you do not agree, or you do > not wish to make it such a parser,

It already *is* such a parser, You just need to see how you can use it's powers without modifying the internals Show quoted text

> that is your prerogative; it is your code.

By now it is, but when I took over, there was a lot to fix. Show quoted text

> I will not bother you any longer with these ideas. Good luck.

Your tone strikes me as if I insulted you. That is not the case. I value your thoughts and your input, but not the proposed fix. Please tell me where or how my above approach fails to attack your problem. -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.10.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.1 & 10.2, AIX 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/

Wed Apr 23 07:40:01 2008 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #35222] Allowing parsing instead of error 2023.
Date:	Wed, 23 Apr 2008 13:39:34 +0200
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Wed, 23 Apr 2008 06:27:47 -0400, "Michael Fowler via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> It might help to have the original input file. You can find it at > http://p3m.org/code/Text-CSV_XS/symbols.csv . The source is the NASDAQ > website, the original URL for this data is > http://www.nasdaq.com//asp/symbols.asp?exchange=N&start=0 .

Get the latest snapshot here: http://repo.or.cz/w/Text-CSV_XS.git?a=snapshot;sf=tgz Get the parse script here: http://www.xs4all.nl/~hmbrand/symbols.pl Run it: Text-CSV_XS 181 > perl -Iblib/{lib,arch} tmp/symbols.pl tmp/symbols.csv Reading tmp/symbols.csv with Text::CSV_XS 0.44 tmp/symbols.csv line 10918/40 - 2023 - EIQ - QUO character not allowed | Cie *"*Le L| Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Accepted in allow mode ... tmp/symbols.csv line 22046/27 - 2023 - EIQ - QUO character not allowed |pany *"*Vimp| Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Extending line with next chunk Accepted in allow mode ... --8<--- symbols.pl #!/pro/bin/perl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1 }); my $csa = Text::CSV_XS->new ({ binary => 1, allow_loose_quotes => 1, escape_char => undef, }); my $file = @ARGV ? shift : "symbols.csv"; open my $fh, "<", $file or die "$file: $!\n"; my %err_eol = map { $_ => 1 } 2010, 2027, 2031, 2032; print STDERR "Reading $file with Text::CSV_XS $Text::CSV_XS::VERSION\n"; while (1) { my $row = $csv->getline ($fh); unless ($row) { $csv->eof and last; my @diag = $csv->error_diag; print STDERR "$file line $./$diag[2] - $diag[0] - $diag[1]\n"; my $ep = $diag[2] - 1; # diag[2] is 1-based my $ein = $csv->error_input; my $err = $ein . " "; substr $err, $ep + 1, 0, "*"; substr $err, $ep, 0, "*"; ($err = substr $err, $ep - 5, 12) =~ s/ +$//; print STDERR " |$err|\n"; REPARSE: { if ($row = $csa->parse ($ein)) { print STDERR "Accepted in allow mode ...\n"; } else { my @diag = $csa->error_diag; if (exists $err_eol{$diag[0]}) { print STDERR " Extending line with next chunk\n"; $ein .= scalar <$fh>; goto REPARSE; } print STDERR " Also could not parse it in allow mode\n"; print STDERR " $./$diag[2] - $diag[0] - $diag[1]\n"; print STDERR " Line skipped\n"; next; } } } # Data was fine, do something with $row } -->8--- -- H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/) using & porting perl 5.6.2, 5.8.x, 5.10.x on HP-UX 10.20, 11.00, 11.11, & 11.23, SuSE 10.1 & 10.2, AIX 5.2, and Cygwin. http://qa.perl.org http://mirrors.develooper.com/hpux/ http://www.test-smoke.org http://www.goldmark.org/jeff/stupid-disclaimers/

Tue May 06 18:15:16 2008 HMBRAND [...] cpan.org - Correspondence added

Closed. No feedback on given solution. I may assume that it worked.

Tue May 06 18:15:19 2008 HMBRAND [...] cpan.org - Status changed from 'open' to 'resolved'

Tue May 06 18:15:22 2008 HMBRAND [...] cpan.org - Given to HMBRAND

Tue Sep 02 04:30:19 2008 HMBRAND [...] cpan.org - Fixed in 0.45 added