Skip Menu |

This queue is for tickets about the Text-CSV_XS CPAN distribution.

Report information
The Basics
Id: 80680
Status: resolved
Priority: 0/
Queue: Text-CSV_XS

People
Owner: Nobody in particular
Requestors: oleg [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 0.50
Fixed in: 0.92



Subject: Text::CSV_XS produces garbage on some data
Hello Recently I found, that some columns in my csv contains unreadable data. It looks like "Екатеринбург". I attached minimal example to reproduce this problem. I tested it on debian sid with perl 5.14.2 and Text::CSV_XS 0.91. The strange thing is that if you'll set `always_quote' to false, then all works fine. At the same time this test works correctly with Text::CSV_PP independently of `always_quote' parameter.
Subject: te.pl
use strict; use Text::CSV_XS; my $csv = Text::CSV_XS->new({eol => "\n", always_quote => 0}); my $row = [ "\x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}", "\x{410}\x{417}\x{421} \x{2116}303", " \x{421}\x{432}\x{435}\x{440}\x{434}\x{43b}\x{43e}\x{432}\x{441}\x{43a}\x{430}\x{44f} \x{43e}\x{431}\x{43b}\x{430}\x{441}\x{442}\x{44c}, \x{433}. \x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}, \x{414}\x{443}\x{431}\x{43b}\x{435}\x{440} \x{421}\x{438}\x{431}\x{438}\x{440}\x{441}\x{43a}\x{43e}\x{433}\x{43e} \x{442}\x{440}\x{430}\x{43a}\x{442}\x{430}, 5 \x{43a}\x{43c} \x{43b}\x{435}\x{432}\x{430}\x{44f} \x{441}\x{442}\x{43e}\x{440}\x{43e}\x{43d}\x{430} ", '', 'G-95', '95', '92', '80', '', "\x{414}\x{422}", '', '', "\x{41c}\x{430}\x{433}\x{430}\x{437}\x{438}\x{43d}", "\x{41a}\x{430}\x{444}\x{435}", "\x{422}\x{443}\x{430}\x{43b}\x{435}\x{442}", '', "\x{41a}\x{440}\x{443}\x{433}\x{43b}\x{43e}\x{441}\x{443}\x{442}\x{43e}\x{447}\x{43d}\x{430}\x{44f} \x{440}\x{430}\x{431}\x{43e}\x{442}\x{430}", '', "\x{41f}\x{440}\x{438}\x{43d}\x{438}\x{43c}\x{430}\x{44e}\x{442} \x{43a}\x{430}\x{440}\x{442}\x{44b} \x{ab}\x{41d}\x{430}\x{43c} \x{43f}\x{43e} \x{43f}\x{443}\x{442}\x{438}\x{bb}", '', '', "\x{422}\x{435}\x{440}\x{43c}\x{438}\x{43d}\x{430}\x{43b} \x{43e}\x{43f}\x{43b}\x{430}\x{442}\x{44b}", '', "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{43e}\x{439} \x{413}\x{41f}\x{41d}", "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{430}\x{43c}\x{438} \"MasterCard\"", "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{430}\x{43c}\x{438} \"Visa\"", "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{430}\x{43c}\x{438} \"Unioncard\"", '', "\x{420}\x{443}\x{447}\x{43d}\x{430}\x{44f} \x{43c}\x{43e}\x{439}\x{43a}\x{430}", '', "\x{41f}\x{43e}\x{434}\x{43a}\x{430}\x{447}\x{43a}\x{430} \x{448}\x{438}\x{43d}", '', "\x{417}\x{430}\x{43c}\x{435}\x{43d}\x{430} \x{43c}\x{430}\x{441}\x{43b}\x{430}", '', '', '', '', '207', "\x{435}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}", 0, 1, "\x{410}\x{417}\x{421} \x{413}\x{430}\x{437}\x{43f}\x{440}\x{43e}\x{43c}\x{43d}\x{435}\x{444}\x{442}\x{44c}-\x{423}\x{440}\x{430}\x{43b}, \x{41a}\x{438}\x{440}\x{43e}\x{432}\x{441}\x{43a}\x{438}\x{439} \x{440}\x{430}\x{439}\x{43e}\x{43d}, \x{2116}303", "\x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}", "\x{421}\x{438}\x{431}\x{438}\x{440}\x{441}\x{43a}\x{438}\x{439} \x{442}\x{440}\x{430}\x{43a}\x{442} \x{434}\x{443}\x{431}\x{43b}\x{435}\x{440} 5 \x{43a}\x{43c}, 1", '1267165676583540', "\x{410}\x{417}\x{421} \x{413}\x{430}\x{437}\x{43f}\x{440}\x{43e}\x{43c}\x{43d}\x{435}\x{444}\x{442}\x{44c}-\x{423}\x{440}\x{430}\x{43b}, \x{41a}\x{438}\x{440}\x{43e}\x{432}\x{441}\x{43a}\x{438}\x{439} \x{440}\x{430}\x{439}\x{43e}\x{43d}, \x{2116}303", "\x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}" ]; binmode STDOUT, ':utf8'; $csv->print(\*STDOUT, $row);
Sorry, in test example `always_quote' already set to 0. Set it to 1 to reproduce the problem.
Subject: Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date: Wed, 7 Nov 2012 19:48:21 +0100
To: bug-Text-CSV_XS [...] rt.cpan.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
On Wed, 7 Nov 2012 02:15:09 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text
> Recently I found, that some columns in my csv contains unreadable data. > It looks like "Екатеринбург". I attached minimal example to > reproduce this problem. I tested it on debian sid with perl 5.14.2 and > Text::CSV_XS 0.91. The strange thing is that if you'll set > `always_quote' to false, then all works fine. At the same time this test > works correctly with Text::CSV_PP independently of `always_quote' parameter.
I found the problem, but I do not have a solution yet. The cause is that in the lowest level, Text::CSV_XS uses bytes, and buffering. When the buffer is "full", it writes. In this case the buffer is not valid UTF8, as the bytes are split halfway a character, so the printed "stuff" is not encoded. When I increase the buffer size from 1024 to 4096, *this* specific problem disappears, but it is just waiting for the next to happen. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Срд Ноя 07 13:48:44 2012, h.m.brand@xs4all.nl писал: Show quoted text
> On Wed, 7 Nov 2012 02:15:09 -0500, "Oleg G via RT" > <bug-Text-CSV_XS@rt.cpan.org> wrote: >
> > Recently I found, that some columns in my csv contains unreadable
> data.
> > It looks like "Екатеринбург". I attached minimal example
> to
> > reproduce this problem. I tested it on debian sid with perl 5.14.2
> and
> > Text::CSV_XS 0.91. The strange thing is that if you'll set > > `always_quote' to false, then all works fine. At the same time this
> test
> > works correctly with Text::CSV_PP independently of `always_quote'
> parameter. > > I found the problem, but I do not have a solution yet. > > The cause is that in the lowest level, Text::CSV_XS uses bytes, and > buffering. When the buffer is "full", it writes. In this case the > buffer is not valid UTF8, as the bytes are split halfway a character, > so the printed "stuff" is not encoded. When I increase the buffer size > from 1024 to 4096, *this* specific problem disappears, but it is just > waiting for the next to happen. >
I think we should check end of the buffer to find such breaked unicode characters. And if found leave this bytes (beginning of next character) in the buffer. But I really don't know how this can be done. May be UTF8SKIP macros may help. Other possible solution is to not put half of the character to the buffer.
RT-Send-CC: h.m.brand [...] xs4all.nl
I attached my version of the patch. Algorithm is simple: do not put part of the character to the buffer if we are in utf8 mode
Subject: breaked-utf8.patch
--- CSV_XS.xs-orig 2012-08-17 12:43:13.000000000 +0700 +++ CSV_XS.xs 2012-11-08 21:49:14.657475530 +0700 @@ -28,6 +28,7 @@ #define MAINT_DEBUG 0 #define BUFFER_SIZE 1024 +#define UTF8_CHAR_SIZE 4 #define CSV_XS_TYPE_PV 0 #define CSV_XS_TYPE_IV 1 @@ -135,6 +136,10 @@ STRLEN size; STRLEN used; char buffer[BUFFER_SIZE]; + + char utf8_char[UTF8_CHAR_SIZE]; + STRLEN utf8_bytes_need; + STRLEN utf8_bytes_used; } csv_t; #define bool_opt_def(o,d) \ @@ -563,6 +568,8 @@ csv->utf8 = 0; csv->size = 0; csv->used = 0; + csv->utf8_bytes_need = 0; + csv->utf8_bytes_used = 0; csv->first_safe_char = csv->quote_space ? 0x21 : 0x20; @@ -620,12 +627,43 @@ } /* Print */ #define CSV_PUT(csv,dst,c) { \ - if ((csv)->used == sizeof ((csv)->buffer) - 1) { \ - unless (Print ((csv), (dst))) \ - return FALSE; \ - } \ - (csv)->buffer[(csv)->used++] = (c); \ - } + char utf8_broken_char = 0; \ + \ + if (csv->utf8) { \ + if (csv->utf8_bytes_need == 0) { \ + int char_len = UTF8SKIP(&c); \ + if (char_len > 1) { \ + if (char_len > UTF8_CHAR_SIZE) \ + croak("unexpected wide character size: %d > %d", char_len, UTF8_CHAR_SIZE); \ + csv->utf8_char[csv->utf8_bytes_used++] = c; \ + csv->utf8_bytes_need = char_len - 1; \ + utf8_broken_char = 1; \ + } \ + } \ + else { \ + csv->utf8_char[csv->utf8_bytes_used++] = c; \ + csv->utf8_bytes_need--; \ + utf8_broken_char = 1; \ + } \ + } \ + \ + if ((csv)->used == sizeof ((csv)->buffer) - 1 || (csv->used + csv->utf8_bytes_used > sizeof(csv->buffer) - 1)) { \ + unless (Print ((csv), (dst))) \ + return FALSE; \ + } \ + \ + if (csv->utf8_bytes_used != 0 && csv->utf8_bytes_need == 0) { \ + int i; \ + for (i = 0; i < csv->utf8_bytes_used; i++) { \ + csv->buffer[csv->used++] = csv->utf8_char[i]; \ + } \ + csv->utf8_bytes_used = 0; \ + } \ + else if (!utf8_broken_char) { \ + (csv)->buffer[(csv)->used++] = (c); \ + } \ + \ + } /* Should be extended for EBCDIC ? */ #define is_csv_binary(ch) ((ch < CH_SPACE || ch >= CH_DEL) && ch != CH_TAB)
Any news on this bug? Or comments about a patch?
Subject: Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date: Mon, 12 Nov 2012 18:04:39 +0100
To: bug-Text-CSV_XS [...] rt.cpan.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
On Thu, 8 Nov 2012 10:10:36 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text
> I attached my version of the patch. > Algorithm is simple: > do not put part of the character to the buffer if we are in utf8 mode
I took a different path. I think my approach is both safer and faster. Could you give it a sanity-check beyond how I tested it myself with this script: --8<--- #!/pro/.bin/perl use 5.16.0; use warnings; use Test::More; use Encode qw( encode decode ); use Text::CSV_XS; my $aq = shift // 0; my $csv = Text::CSV_XS->new ({ binary => 1, always_quote => $aq, # eol => "\n", auto_diag => 1, }); my $txt = "\x{415}\x{43a}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}\x{2116}"; foreach my $n (1 .. 4200) { foreach my $e (0 .. 3) { my $data = ("a"x$e).($txt x $n) ; my $enc = encode ("UTF-8", $data); my $exp = qq{1,"$enc"}; my $out = ""; open my $fh, ">:encoding(utf-8)", \$out; $csv->print ($fh, [ 1, $data ]); close $fh; my $l = length ($out); if ($out eq $exp) { ok (1, "$n/$e ($l)"); } else { is ($out, $exp, "Data $n/$e ($l)"); done_testing (); exit 1; } } } done_testing (); -->8--- And here is the patch: --8<--- CSV_XS.diff diff --git a/CSV_XS.xs b/CSV_XS.xs index a67a812..ce59274 100644 --- a/CSV_XS.xs +++ b/CSV_XS.xs @@ -587,6 +587,7 @@ static void cx_SetupCsv (pTHX_ csv_t *csv, HV *self, SV *pself) static int cx_Print (pTHX_ csv_t *csv, SV *dst) { int result; + int keep = 0; if (csv->useIO) { SV *tmp = newSVpv (csv->buffer, csv->used); @@ -597,8 +598,21 @@ static int cx_Print (pTHX_ csv_t *csv, SV *dst) PUSHs ((dst)); PUSHs (tmp); PUTBACK; - if (csv->utf8 && is_utf8_sv (tmp)) + if (csv->utf8) { + STRLEN len; + char *ptr; + int j, l; + + ptr = SvPV (tmp, len); + while (len > 0 && !is_utf8_sv (tmp) && keep < 16) { + ptr[--len] = (char)0; + SvCUR_set (tmp, len); + keep++; + } + for (j = 0; j < keep; j++) + csv->buffer[j] = csv->buffer[csv->used - keep + j]; SvUTF8_on (tmp); + } result = call_sv (m_print, G_SCALAR | G_METHOD); SPAGAIN; if (result) { @@ -615,7 +629,7 @@ static int cx_Print (pTHX_ csv_t *csv, SV *dst) } if (csv->utf8 && SvROK (dst) && is_utf8_sv (SvRV (dst))) SvUTF8_on (SvRV (dst)); - csv->used = 0; + csv->used = keep; return result; } /* Print */ -->8--- -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Show quoted text
> I took a different path. I think my approach is both safer and faster. > Could you give it a sanity-check beyond how I tested it myself with > this script:
Looks good. The only thing I can't understand is "16". Benchmark shows that your solution is faster: Rate Oleg Brend Oleg 56366/s -- -18% Brand 68923/s 22% --
Subject: Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date: Tue, 13 Nov 2012 08:43:37 +0100
To: bug-Text-CSV_XS [...] rt.cpan.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
On Tue, 13 Nov 2012 02:11:35 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text
> Queue: Text-CSV_XS > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=80680 > >
> > I took a different path. I think my approach is both safer and faster. > > Could you give it a sanity-check beyond how I tested it myself with > > this script:
> Looks good. The only thing I can't understand is "16".
You mean in 'use 5.16.0;'? Feel free to lower that bar, it is getting a habit of writing that instead of using 'use strict;' and get the latest features for free. Show quoted text
> Benchmark shows that your solution is faster: > Rate Oleg Brend > Oleg 56366/s -- -18% > Brand 68923/s 22% --
Good, I'll release soon. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Show quoted text
> You mean in 'use 5.16.0;'?
No, I mean "keep < 16"
Subject: Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date: Tue, 13 Nov 2012 09:46:50 +0100
To: bug-Text-CSV_XS [...] rt.cpan.org
From: "H.Merijn Brand" <h.m.brand [...] xs4all.nl>
On Tue, 13 Nov 2012 03:06:48 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text
> Queue: Text-CSV_XS > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=80680 > >
> > You mean in 'use 5.16.0;'?
> No, I mean "keep < 16"
That is to prevent tracking back into the (large) buffer beyond what could be a broken character. It is a safeguard for invalid UTF-8 that is still marked UTF-8 but shouldn't. If I wouldn't do this, there is an optional buffer overrun. /me is preparing a release … -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/
Show quoted text
> > No, I mean "keep < 16"
> > That is to prevent tracking back into the (large) buffer beyond what > could be a broken character. It is a safeguard for invalid UTF-8 that > is still marked UTF-8 but shouldn't. If I wouldn't do this, there is > an optional buffer overrun. > > /me is preparing a release … >
That's ok, but as I know maximum length of the valid utf-8 character is 4 bytes.