Skip Menu |

This queue is for tickets about the Text-CSV_XS CPAN distribution.

Report information
The Basics
Id: 42642
Status: resolved
Priority: 0/
Queue: Text-CSV_XS

People
Owner: Nobody in particular
Requestors: MSISK [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: 0.59



Subject: failure on unusual quote/sep values
See the attached example file. For a quote character of 0xfe and a separator character of 0x14 (with utf8 flag on for these strings), parsing fails with: ECR - Characters after end of quoted field Without the utf8 flag set on the quote/sep strings, parsing fails to separate the input into fields but produces no error. This works with Text::CSV_PP. Thanks, Matt P.S. yes, I've been encountering CSV files formatted this way out in the real world.
Subject: test.csv
þDOGþþCATþþWOMBATþþBANDERSNATCHþ þ0þþ1þþ2þþ3þ
Subject: Re: [rt.cpan.org #42642] failure on unusual quote/sep values
Date: Thu, 22 Jan 2009 16:58:34 +0100
To: bug-Text-CSV_XS [...] rt.cpan.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
On Wed, 21 Jan 2009 17:56:03 -0500, "MSISK via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text
> See the attached example file. For a quote character of 0xfe and a > separator character of 0x14 (with utf8 flag on for these strings), > parsing fails with: > > ECR - Characters after end of quoted field > > Without the utf8 flag set on the quote/sep strings, parsing fails to > separate the input into fields but produces no error. > > This works with Text::CSV_PP.
Text::CSV_XS does NOT support quote and separator characters in Unicode as per documented specifications $ cat 42642.pl use strict; use warnings; use Data::Peek; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({ binary => 1, quote_char => "\xfe", sep_char => "\x14", }); open my $fh, "<", "42642.csv" or die "42462.csv: $!\n"; while (my $row = $csv->getline ($fh)) { print "Row $.\n"; print " ", DPeek ($_), "\n" for @$row; } $csv->eof or $csv->error_diag; $ perl 42642.pl Row 1 PV("DOG"\0) PV("CAT"\0) PV("WOMBAT"\0) PV("BANDERSNATCH"\0) Row 2 PV("0"\0) PV("1"\0) PV("2"\0) PV("3"\0) From the docs: --8<--- Though this is the most clear and restrictive definition, Text::CSV_XS is way more liberal than this, and allows extension: · Line termination by a single carriage return is accepted by default · The separation-, escape-, and escape- characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters outside this range may or may not work as expected. Multibyte charac- ters, like U+060c (ARABIC COMMA), U+FF0C (FULLWIDTH COMMA), U+241B (SYMBOL FOR ESCAPE), U+2424 (SYMBOL FOR NEWLINE), U+FF02 (FULLWIDTH QUOTATION MARK), and U+201C (LEFT DOUBLE QUOTATION MARK) (to give some examples of what might look promising) are therefor not allowed. -->8--- The solution to you problem is to decode your sep_char and quote_char To demonstrate that that works (maybe not through the most elegant solution: $ cat 42642.pl use strict; use warnings; use Data::Peek; use Text::CSV_XS; my $quo = substr ("\xfe\x{20ac}", 0, 1); my $sep = substr ("\x14\x{20ac}", 0, 1); print "quote: ", DPeek ($quo), "\n"; print "sep: ", DPeek ($sep), "\n"; utf8::decode ($quo); utf8::decode ($sep); my $csv = Text::CSV_XS->new ({ binary => 1, quote_char => $quo, sep_char => $sep, }); print "quote: ", DPeek ($csv->quote_char), "\n"; print "sep: ", DPeek ($csv->sep_char), "\n"; open my $fh, "<", "42642.csv" or die "42462.csv: $!\n"; while (my $row = $csv->getline ($fh)) { print "Row $.\n"; print " ", DPeek ($_), "\n" for @$row; } $csv->eof or $csv->error_diag; $ perl 42642.pl quote: PV("\303\276"\0) [UTF8 "\x{fe}"] sep: PV("\24"\0) [UTF8 "\x{14}"] quote: PV("\376"\0) sep: PV("\24"\0) Row 1 PV("DOG"\0) PV("CAT"\0) PV("WOMBAT"\0) PV("BANDERSNATCH"\0) Row 2 PV("0"\0) PV("1"\0) PV("2"\0) PV("3"\0) $ -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using & porting perl 5.6.2, 5.8.x, 5.10.x, 5.11.x on HP-UX 10.20, 11.00, 11.11, 11.23, and 11.31, SuSE 10.1, 10.3, and 11.0, AIX 5.2, and Cygwin. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
The last paragraph in this section should address this, new in 0.59: --8<--- · The separation-, escape-, and escape- characters can be any ASCII character in the range from 0x20 (space) to 0x7E (tilde). Characters outside this range may or may not work as expected. Multibyte characters, like U+060c (ARABIC COMMA), U+FF0C (FULLWIDTH COMMA), U+241B (SYMBOL FOR ESCAPE), U+2424 (SYMBOL FOR NEWLINE), U+FF02 (FULLWIDTH QUOTATION MARK), and U+201C (LEFT DOUBLE QUOTATION MARK) (to give some examples of what might look promising) are therefor not allowed. If you use perl-5.8.2 or higher, these three attributes are utf8-decoded, to increase the likelyhood of success. This way U +00FE will be allowed as a quote character. -->8---