Bug #80680 for Text-CSV_XS: Text::CSV_XS produces garbage on some data

Wed Nov 07 02:15:08 2012 oleg [...] cpan.org - Ticket created

Subject:

Text::CSV_XS produces garbage on some data

Hello Recently I found, that some columns in my csv contains unreadable data. It looks like "ÐÐºÐ°ÑÐµÑÐ¸Ð½Ð±ÑÑÐ³". I attached minimal example to reproduce this problem. I tested it on debian sid with perl 5.14.2 and Text::CSV_XS 0.91. The strange thing is that if you'll set `always_quote' to false, then all works fine. At the same time this test works correctly with Text::CSV_PP independently of `always_quote' parameter.

Subject:

te.pl

use strict; use Text::CSV_XS; my $csv = Text::CSV_XS->new({eol => "\n", always_quote => 0}); my $row = [ "\x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}", "\x{410}\x{417}\x{421} \x{2116}303", " \x{421}\x{432}\x{435}\x{440}\x{434}\x{43b}\x{43e}\x{432}\x{441}\x{43a}\x{430}\x{44f} \x{43e}\x{431}\x{43b}\x{430}\x{441}\x{442}\x{44c}, \x{433}. \x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}, \x{414}\x{443}\x{431}\x{43b}\x{435}\x{440} \x{421}\x{438}\x{431}\x{438}\x{440}\x{441}\x{43a}\x{43e}\x{433}\x{43e} \x{442}\x{440}\x{430}\x{43a}\x{442}\x{430}, 5 \x{43a}\x{43c} \x{43b}\x{435}\x{432}\x{430}\x{44f} \x{441}\x{442}\x{43e}\x{440}\x{43e}\x{43d}\x{430} ", '', 'G-95', '95', '92', '80', '', "\x{414}\x{422}", '', '', "\x{41c}\x{430}\x{433}\x{430}\x{437}\x{438}\x{43d}", "\x{41a}\x{430}\x{444}\x{435}", "\x{422}\x{443}\x{430}\x{43b}\x{435}\x{442}", '', "\x{41a}\x{440}\x{443}\x{433}\x{43b}\x{43e}\x{441}\x{443}\x{442}\x{43e}\x{447}\x{43d}\x{430}\x{44f} \x{440}\x{430}\x{431}\x{43e}\x{442}\x{430}", '', "\x{41f}\x{440}\x{438}\x{43d}\x{438}\x{43c}\x{430}\x{44e}\x{442} \x{43a}\x{430}\x{440}\x{442}\x{44b} \x{ab}\x{41d}\x{430}\x{43c} \x{43f}\x{43e} \x{43f}\x{443}\x{442}\x{438}\x{bb}", '', '', "\x{422}\x{435}\x{440}\x{43c}\x{438}\x{43d}\x{430}\x{43b} \x{43e}\x{43f}\x{43b}\x{430}\x{442}\x{44b}", '', "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{43e}\x{439} \x{413}\x{41f}\x{41d}", "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{430}\x{43c}\x{438} \"MasterCard\"", "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{430}\x{43c}\x{438} \"Visa\"", "\x{41e}\x{43f}\x{43b}\x{430}\x{442}\x{430} \x{43a}\x{430}\x{440}\x{442}\x{430}\x{43c}\x{438} \"Unioncard\"", '', "\x{420}\x{443}\x{447}\x{43d}\x{430}\x{44f} \x{43c}\x{43e}\x{439}\x{43a}\x{430}", '', "\x{41f}\x{43e}\x{434}\x{43a}\x{430}\x{447}\x{43a}\x{430} \x{448}\x{438}\x{43d}", '', "\x{417}\x{430}\x{43c}\x{435}\x{43d}\x{430} \x{43c}\x{430}\x{441}\x{43b}\x{430}", '', '', '', '', '207', "\x{435}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}", 0, 1, "\x{410}\x{417}\x{421} \x{413}\x{430}\x{437}\x{43f}\x{440}\x{43e}\x{43c}\x{43d}\x{435}\x{444}\x{442}\x{44c}-\x{423}\x{440}\x{430}\x{43b}, \x{41a}\x{438}\x{440}\x{43e}\x{432}\x{441}\x{43a}\x{438}\x{439} \x{440}\x{430}\x{439}\x{43e}\x{43d}, \x{2116}303", "\x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}", "\x{421}\x{438}\x{431}\x{438}\x{440}\x{441}\x{43a}\x{438}\x{439} \x{442}\x{440}\x{430}\x{43a}\x{442} \x{434}\x{443}\x{431}\x{43b}\x{435}\x{440} 5 \x{43a}\x{43c}, 1", '1267165676583540', "\x{410}\x{417}\x{421} \x{413}\x{430}\x{437}\x{43f}\x{440}\x{43e}\x{43c}\x{43d}\x{435}\x{444}\x{442}\x{44c}-\x{423}\x{440}\x{430}\x{43b}, \x{41a}\x{438}\x{440}\x{43e}\x{432}\x{441}\x{43a}\x{438}\x{439} \x{440}\x{430}\x{439}\x{43e}\x{43d}, \x{2116}303", "\x{415}\x{43a}\x{430}\x{442}\x{435}\x{440}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}" ]; binmode STDOUT, ':utf8'; $csv->print(\*STDOUT, $row);

Wed Nov 07 02:20:27 2012 oleg [...] cpan.org - Correspondence added

Sorry, in test example `always_quote' already set to 0. Set it to 1 to reproduce the problem.

Wed Nov 07 02:20:27 2012 oleg [...] cpan.org - Status changed from 'new' to 'open'

Wed Nov 07 13:48:44 2012 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date:	Wed, 7 Nov 2012 19:48:21 +0100
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Wed, 7 Nov 2012 02:15:09 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> Recently I found, that some columns in my csv contains unreadable data. > It looks like "ÐÐºÐ°ÑÐµÑÐ¸Ð½Ð±ÑÑÐ³". I attached minimal example to > reproduce this problem. I tested it on debian sid with perl 5.14.2 and > Text::CSV_XS 0.91. The strange thing is that if you'll set > `always_quote' to false, then all works fine. At the same time this test > works correctly with Text::CSV_PP independently of `always_quote' parameter.

I found the problem, but I do not have a solution yet. The cause is that in the lowest level, Text::CSV_XS uses bytes, and buffering. When the buffer is "full", it writes. In this case the buffer is not valid UTF8, as the bytes are split halfway a character, so the printed "stuff" is not encoded. When I increase the buffer size from 1024 to 4096, *this* specific problem disappears, but it is just waiting for the next to happen. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Thu Nov 08 01:35:06 2012 oleg [...] cpan.org - Correspondence added

Срд Ноя 07 13:48:44 2012, h.m.brand@xs4all.nl писал: Show quoted text

> On Wed, 7 Nov 2012 02:15:09 -0500, "Oleg G via RT" > <bug-Text-CSV_XS@rt.cpan.org> wrote: >

> > Recently I found, that some columns in my csv contains unreadable

> data.

> > It looks like "ÐÐºÐ°ÑÐµÑÐ¸Ð½Ð±ÑÑÐ³". I attached minimal example

> to

> > reproduce this problem. I tested it on debian sid with perl 5.14.2

> and

> > Text::CSV_XS 0.91. The strange thing is that if you'll set > > `always_quote' to false, then all works fine. At the same time this

> test

> > works correctly with Text::CSV_PP independently of `always_quote'

> parameter. > > I found the problem, but I do not have a solution yet. > > The cause is that in the lowest level, Text::CSV_XS uses bytes, and > buffering. When the buffer is "full", it writes. In this case the > buffer is not valid UTF8, as the bytes are split halfway a character, > so the printed "stuff" is not encoded. When I increase the buffer size > from 1024 to 4096, *this* specific problem disappears, but it is just > waiting for the next to happen. >

I think we should check end of the buffer to find such breaked unicode characters. And if found leave this bytes (beginning of next character) in the buffer. But I really don't know how this can be done. May be UTF8SKIP macros may help. Other possible solution is to not put half of the character to the buffer.

Thu Nov 08 10:10:35 2012 oleg [...] cpan.org - Correspondence added

RT-Send-CC:

h.m.brand [...] xs4all.nl

I attached my version of the patch. Algorithm is simple: do not put part of the character to the buffer if we are in utf8 mode

Subject:

breaked-utf8.patch

--- CSV_XS.xs-orig 2012-08-17 12:43:13.000000000 +0700 +++ CSV_XS.xs 2012-11-08 21:49:14.657475530 +0700 @@ -28,6 +28,7 @@ #define MAINT_DEBUG 0 #define BUFFER_SIZE 1024 +#define UTF8_CHAR_SIZE 4 #define CSV_XS_TYPE_PV 0 #define CSV_XS_TYPE_IV 1 @@ -135,6 +136,10 @@ STRLEN size; STRLEN used; char buffer[BUFFER_SIZE]; + + char utf8_char[UTF8_CHAR_SIZE]; + STRLEN utf8_bytes_need; + STRLEN utf8_bytes_used; } csv_t; #define bool_opt_def(o,d) \ @@ -563,6 +568,8 @@ csv->utf8 = 0; csv->size = 0; csv->used = 0; + csv->utf8_bytes_need = 0; + csv->utf8_bytes_used = 0; csv->first_safe_char = csv->quote_space ? 0x21 : 0x20; @@ -620,12 +627,43 @@ } /* Print */ #define CSV_PUT(csv,dst,c) { \ - if ((csv)->used == sizeof ((csv)->buffer) - 1) { \ - unless (Print ((csv), (dst))) \ - return FALSE; \ - } \ - (csv)->buffer[(csv)->used++] = (c); \ - } + char utf8_broken_char = 0; \ + \ + if (csv->utf8) { \ + if (csv->utf8_bytes_need == 0) { \ + int char_len = UTF8SKIP(&c); \ + if (char_len > 1) { \ + if (char_len > UTF8_CHAR_SIZE) \ + croak("unexpected wide character size: %d > %d", char_len, UTF8_CHAR_SIZE); \ + csv->utf8_char[csv->utf8_bytes_used++] = c; \ + csv->utf8_bytes_need = char_len - 1; \ + utf8_broken_char = 1; \ + } \ + } \ + else { \ + csv->utf8_char[csv->utf8_bytes_used++] = c; \ + csv->utf8_bytes_need--; \ + utf8_broken_char = 1; \ + } \ + } \ + \ + if ((csv)->used == sizeof ((csv)->buffer) - 1 || (csv->used + csv->utf8_bytes_used > sizeof(csv->buffer) - 1)) { \ + unless (Print ((csv), (dst))) \ + return FALSE; \ + } \ + \ + if (csv->utf8_bytes_used != 0 && csv->utf8_bytes_need == 0) { \ + int i; \ + for (i = 0; i < csv->utf8_bytes_used; i++) { \ + csv->buffer[csv->used++] = csv->utf8_char[i]; \ + } \ + csv->utf8_bytes_used = 0; \ + } \ + else if (!utf8_broken_char) { \ + (csv)->buffer[(csv)->used++] = (c); \ + } \ + \ + } /* Should be extended for EBCDIC ? */ #define is_csv_binary(ch) ((ch < CH_SPACE || ch >= CH_DEL) && ch != CH_TAB)

Mon Nov 12 11:34:25 2012 oleg [...] cpan.org - Correspondence added

Any news on this bug? Or comments about a patch?

Mon Nov 12 12:04:52 2012 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date:	Mon, 12 Nov 2012 18:04:39 +0100
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Thu, 8 Nov 2012 10:10:36 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> I attached my version of the patch. > Algorithm is simple: > do not put part of the character to the buffer if we are in utf8 mode

I took a different path. I think my approach is both safer and faster. Could you give it a sanity-check beyond how I tested it myself with this script: --8<--- #!/pro/.bin/perl use 5.16.0; use warnings; use Test::More; use Encode qw( encode decode ); use Text::CSV_XS; my $aq = shift // 0; my $csv = Text::CSV_XS->new ({ binary => 1, always_quote => $aq, # eol => "\n", auto_diag => 1, }); my $txt = "\x{415}\x{43a}\x{438}\x{43d}\x{431}\x{443}\x{440}\x{433}\x{2116}"; foreach my $n (1 .. 4200) { foreach my $e (0 .. 3) { my $data = ("a"x$e).($txt x $n) ; my $enc = encode ("UTF-8", $data); my $exp = qq{1,"$enc"}; my $out = ""; open my $fh, ">:encoding(utf-8)", \$out; $csv->print ($fh, [ 1, $data ]); close $fh; my $l = length ($out); if ($out eq $exp) { ok (1, "$n/$e ($l)"); } else { is ($out, $exp, "Data $n/$e ($l)"); done_testing (); exit 1; } } } done_testing (); -->8--- And here is the patch: --8<--- CSV_XS.diff diff --git a/CSV_XS.xs b/CSV_XS.xs index a67a812..ce59274 100644 --- a/CSV_XS.xs +++ b/CSV_XS.xs @@ -587,6 +587,7 @@ static void cx_SetupCsv (pTHX_ csv_t *csv, HV *self, SV *pself) static int cx_Print (pTHX_ csv_t *csv, SV *dst) { int result; + int keep = 0; if (csv->useIO) { SV *tmp = newSVpv (csv->buffer, csv->used); @@ -597,8 +598,21 @@ static int cx_Print (pTHX_ csv_t *csv, SV *dst) PUSHs ((dst)); PUSHs (tmp); PUTBACK; - if (csv->utf8 && is_utf8_sv (tmp)) + if (csv->utf8) { + STRLEN len; + char *ptr; + int j, l; + + ptr = SvPV (tmp, len); + while (len > 0 && !is_utf8_sv (tmp) && keep < 16) { + ptr[--len] = (char)0; + SvCUR_set (tmp, len); + keep++; + } + for (j = 0; j < keep; j++) + csv->buffer[j] = csv->buffer[csv->used - keep + j]; SvUTF8_on (tmp); + } result = call_sv (m_print, G_SCALAR | G_METHOD); SPAGAIN; if (result) { @@ -615,7 +629,7 @@ static int cx_Print (pTHX_ csv_t *csv, SV *dst) } if (csv->utf8 && SvROK (dst) && is_utf8_sv (SvRV (dst))) SvUTF8_on (SvRV (dst)); - csv->used = 0; + csv->used = keep; return result; } /* Print */ -->8--- -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Tue Nov 13 02:11:34 2012 oleg [...] cpan.org - Correspondence added

Show quoted text

> I took a different path. I think my approach is both safer and faster. > Could you give it a sanity-check beyond how I tested it myself with > this script:

Looks good. The only thing I can't understand is "16". Benchmark shows that your solution is faster: Rate Oleg Brend Oleg 56366/s -- -18% Brand 68923/s 22% --

Tue Nov 13 02:43:51 2012 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date:	Tue, 13 Nov 2012 08:43:37 +0100
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Tue, 13 Nov 2012 02:11:35 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> Queue: Text-CSV_XS > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=80680 > >

> > I took a different path. I think my approach is both safer and faster. > > Could you give it a sanity-check beyond how I tested it myself with > > this script:

> Looks good. The only thing I can't understand is "16".

You mean in 'use 5.16.0;'? Feel free to lower that bar, it is getting a habit of writing that instead of using 'use strict;' and get the latest features for free. Show quoted text

> Benchmark shows that your solution is faster: > Rate Oleg Brend > Oleg 56366/s -- -18% > Brand 68923/s 22% --

Good, I'll release soon. -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Tue Nov 13 03:06:48 2012 oleg [...] cpan.org - Correspondence added

Show quoted text

> You mean in 'use 5.16.0;'?

No, I mean "keep < 16"

Tue Nov 13 03:53:49 2012 h.m.brand [...] xs4all.nl - Correspondence added

Subject:	Re: [rt.cpan.org #80680] Text::CSV_XS produces garbage on some data
Date:	Tue, 13 Nov 2012 09:46:50 +0100
To:	bug-Text-CSV_XS [...] rt.cpan.org
From:	"H.Merijn Brand" <h.m.brand [...] xs4all.nl>

On Tue, 13 Nov 2012 03:06:48 -0500, "Oleg G via RT" <bug-Text-CSV_XS@rt.cpan.org> wrote: Show quoted text

> Queue: Text-CSV_XS > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=80680 > >

> > You mean in 'use 5.16.0;'?

> No, I mean "keep < 16"

That is to prevent tracking back into the (large) buffer beyond what could be a broken character. It is a safeguard for invalid UTF-8 that is still marked UTF-8 but shouldn't. If I wouldn't do this, there is an optional buffer overrun. /me is preparing a release … -- H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.17 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Tue Nov 13 03:56:35 2012 oleg [...] cpan.org - Correspondence added

Show quoted text

> > No, I mean "keep < 16"

> > That is to prevent tracking back into the (large) buffer beyond what > could be a broken character. It is a safeguard for invalid UTF-8 that > is still marked UTF-8 but shouldn't. If I wouldn't do this, there is > an optional buffer overrun. > > /me is preparing a release … >

That's ok, but as I know maximum length of the valid utf-8 character is 4 bytes.

Tue Nov 13 04:43:20 2012 HMBRAND [...] cpan.org - Broken in 0.50 added

Tue Nov 13 04:43:20 2012 HMBRAND [...] cpan.org - Broken in 0.91 deleted

Tue Nov 13 04:43:20 2012 HMBRAND [...] cpan.org - Status changed from 'open' to 'patched'

Mon Nov 19 07:33:28 2012 HMBRAND [...] cpan.org - Fixed in 0.92 added

Mon Nov 19 07:33:28 2012 HMBRAND [...] cpan.org - Status changed from 'patched' to 'resolved'