Bug #73623 for Encode: [perl #107326] perl's unicode conversion fails when iconv succeeds

Fri Dec 30 14:00:32 2011 perlbug-followup [...] perl.org - Ticket created

CC:	perl5-porters [...] perl.org, bug-Encode [...] rt.cpan.org
Subject:	[perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Fri, 30 Dec 2011 11:00:23 -0800
To:	"OtherRecipients of perl Ticket #107326":;
From:	"Father Chrysostomos via RT" <perlbug-followup [...] perl.org>

On Fri Dec 30 10:41:46 2011, LAWalsh wrote: Show quoted text

> > This is a bug report for perl from perl-diddler@tlinx.org, > generated with the help of perlbug 1.39 running under perl 5.12.3. > > > ----------------------------------------------------------------- > [Please describe your issue here] > > Was looking at ways to do upper/lower case compare, and bumped into > piconv as being a 'drop in replacement for "iconv"'. So I decided to try > it thinking it would be a 'hoot' if it was faster. > > Rather than faster, it choked at the beginning of my 98M test file > (i.e. I truncated the file to the first few lines, 672 bytes), which > reproduces the problem just fine .. Tr�s sad... >

You‘re right: $ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in UTF-16:Unrecognised BOM d at /usr/local/lib/perl5/5.15.6/darwin-thread-multi-2level/Encode.pm line 196, <$ifh> line 2. The file begins with <FF><FE>. If I use utf-16le explicitly, it does the first line correctly, but quickly switches to Chinese, which means it’s off by one byte. If I use utf-16be explicitly, the first line is in Chinese. This is part of the Encode distribution, for which CPAN is upstream, so I’m forwarding this to the CPAN ticket. -- Father Chrysostomos --- via perlbug: queue: perl5 status: new https://rt.perl.org:443/rt3/Ticket/Display.html?id=107326

Download test.in
application/octet-stream 672b

Message body not shown because it is not plain text.

Fri Dec 30 17:36:19 2011 IKEGAMI [...] cpan.org - Correspondence added

On Fri Dec 30 14:00:32 2011, perlbug-followup@perl.org wrote: Show quoted text

> On Fri Dec 30 10:41:46 2011, LAWalsh wrote:

> > > > This is a bug report for perl from perl-diddler@tlinx.org, > > generated with the help of perlbug 1.39 running under perl 5.12.3. > > > > > > ----------------------------------------------------------------- > > [Please describe your issue here] > > > > Was looking at ways to do upper/lower case compare, and bumped into > > piconv as being a 'drop in replacement for "iconv"'. So I decided

to try Show quoted text

> > it thinking it would be a 'hoot' if it was faster. > > > > Rather than faster, it choked at the beginning of my 98M test file > > (i.e. I truncated the file to the first few lines, 672 bytes), which > > reproduces the problem just fine .. Tr�s sad... > >

> > You‘re right: > > $ piconv5.15.6 -f utf16 -t utf-8 /Users/sprout/Downloads/test.in > UTF-16:Unrecognised BOM d at > /usr/local/lib/perl5/5.15.6/darwin-thread-multi-2level/Encode.pm line > 196, <$ifh> line 2. > > The file begins with <FF><FE>. > > If I use utf-16le explicitly, it does the first line correctly, but > quickly switches to Chinese, which means it’s off by one byte.

It sounds like it's reading line-by-line, where a line is a sequence of bytes ended by 0A. Of course, that's wrong for UTF-16le (and UTF-16be, for that matter).

Fri Dec 30 17:36:20 2011 The RT System itself - Status changed from 'new' to 'open'

Fri Dec 30 17:49:01 2011 IKEGAMI [...] cpan.org - Correspondence added

Fix: - my $need2slurp = $use_bom{ find_encoding($to)->name }; + my $need2slurp = $use_bom{ find_encoding($from)->name }; + if ($Opt{debug}){ + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; + }

Fri Dec 30 18:15:24 2011 pause [...] tlinx.org - Correspondence added

On Fri Dec 30 17:49:01 2011, ikegami wrote: Show quoted text

> Fix: > > - my $need2slurp = $use_bom{ find_encoding($to)->name }; > + my $need2slurp = $use_bom{ find_encoding($from)->name }; > + if ($Opt{debug}){ > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; > + }

---- Not to be pushy or anything, but where does one apply that fix? I couldn't find a any need2slurp in my /usr/lib/perl5/{5.1{0.0,2.{1,3}}.0,{site,vendor}_perl} library dirs, so I don't know that the above lines were responsible for this particular breakage...but then I may not be searching in the right spots... As for the lines in the file I submitted-- they looked like they all had CRLF as line separators...

Fri Dec 30 18:32:00 2011 ikegami [...] adaelis.com - Correspondence added

CC:	perlbug-followup [...] perl.org
Subject:	Re: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Fri, 30 Dec 2011 18:31:51 -0500
To:	bug-Encode [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Fri, Dec 30, 2011 at 6:15 PM, Linda A Walsh via RT < bug-Encode@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > > > On Fri Dec 30 17:49:01 2011, ikegami wrote:

> > Fix: > > > > - my $need2slurp = $use_bom{ find_encoding($to)->name }; > > + my $need2slurp = $use_bom{ find_encoding($from)->name }; > > + if ($Opt{debug}){ > > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; > > + }

> > > ---- > Not to be pushy or anything, but where does one apply that fix? I > couldn't find a any need2slurp in my > /usr/lib/perl5/{5.1{0.0,2.{1,3}}.0,{site,vendor}_perl} library dirs, so > I don't know that the above lines were responsible for this particular > breakage...but then I may not be searching in the right spots... > > As for the lines in the file I submitted-- they looked like they all had > CRLF as line separators... >

piconv

Fri Dec 30 18:39:23 2011 ikegami [...] adaelis.com - Correspondence added

CC:	perlbug-followup [...] perl.org
Subject:	Re: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Fri, 30 Dec 2011 18:39:14 -0500
To:	bug-Encode [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Fri, Dec 30, 2011 at 6:15 PM, Linda A Walsh via RT < bug-Encode@rt.cpan.org> wrote: Show quoted text

> As for the lines in the file I submitted-- they looked like they all had > CRLF as line separators... >

Probably. And not really relevant. piconv was treating your file as a series of lines ending with 0A *before decoding*. LF is not 0A in UTF-16le, and an 0A is not necessarily part of a LF in UTF-16le.

Fri Dec 30 18:44:35 2011 pause [...] tlinx.org - Correspondence added

On Fri Dec 30 17:49:01 2011, ikegami wrote: Show quoted text

> Fix: > > - my $need2slurp = $use_bom{ find_encoding($to)->name }; > + my $need2slurp = $use_bom{ find_encoding($from)->name }; > + if ($Opt{debug}){ > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; > + }

===== Partly works: Show quoted text

> piconv -f UTF-16 -t UTF-8 <test.in >test.out > iconv -f UTF-16 -t UTF-8 <test.in >testi.out > cmp testi.out test.out && echo ok

ok Show quoted text

> piconv -f UTF-8 -t UTF-16 <test.out >test2.out > cmp testi.in test2.out

test.in test2.out differ: byte 1, line 1 test.out was same size

Fri Dec 30 18:49:46 2011 pause [...] tlinx.org - Correspondence added

On Fri Dec 30 18:44:35 2011, LAWALSH wrote: Show quoted text

> On Fri Dec 30 17:49:01 2011, ikegami wrote:

> > Fix: > > > > - my $need2slurp = $use_bom{ find_encoding($to)->name }; > > + my $need2slurp = $use_bom{ find_encoding($from)->name }; > > + if ($Opt{debug}){ > > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; > > + }

> ===== > Partly works:

> > piconv -f UTF-16 -t UTF-8 <test.in >test.out > > iconv -f UTF-16 -t UTF-8 <test.in >testi.out > > cmp testi.out test.out && echo ok

> ok

> > piconv -f UTF-8 -t UTF-16 <test.out >test2.out > > cmp testi.in test2.out

^^^^ typo.. was 'test'... anyway. the piconv doesn't do round trip, the way iconv does. Sounds like it might be assuming UTF-16 means BE and not LE? Just a WAG..

Fri Dec 30 18:57:52 2011 pause [...] tlinx.org - Correspondence added

On Fri Dec 30 18:49:46 2011, LAWALSH wrote: Show quoted text

> On Fri Dec 30 18:44:35 2011, LAWALSH wrote:

>> # piconv -f UTF-8 -t UTF-16 <test.out >test2.out >> # cmp test.in test2.out >> test.in test2.out differ: byte 1, line 1 test.out was same size

> > Sounds like it might be assuming UTF-16 means BE and not LE?

---- Yup: cmp -l -b test.in test2.out 1 377 M-^? 376 M-~ 2 376 M-~ 377 M-^? 3 127 W 0 ^@ 4 0 ^@ 127 W 5 151 i 0 ^@ ... 671 12 ^J 0 ^@ 672 0 ^@ 134 \ cmp: EOF on test.in

Fri Dec 30 19:01:30 2011 ikegami [...] adaelis.com - Correspondence added

CC:	perlbug-followup [...] perl.org
Subject:	Re: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Fri, 30 Dec 2011 19:01:22 -0500
To:	bug-Encode [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Fri, Dec 30, 2011 at 6:44 PM, Linda A Walsh via RT < bug-Encode@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > > > On Fri Dec 30 17:49:01 2011, ikegami wrote:

> > Fix: > > > > - my $need2slurp = $use_bom{ find_encoding($to)->name }; > > + my $need2slurp = $use_bom{ find_encoding($from)->name }; > > + if ($Opt{debug}){ > > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; > > + }

> ===== > Partly works:

> > piconv -f UTF-16 -t UTF-8 <test.in >test.out > > iconv -f UTF-16 -t UTF-8 <test.in >testi.out > > cmp testi.out test.out && echo ok

> ok

> > piconv -f UTF-8 -t UTF-16 <test.out >test2.out > > cmp testi.in test2.out

> test.in test2.out differ: byte 1, line 1 >

C<< decode('UTF-16', ...) >> both requires a BOM and removes it (intentionally). If you want to keep the BOM, use UTF-16le (the actual encoding) instead of UTF-16. This is unrelated to this ticket. - Eric

Fri Dec 30 19:04:31 2011 ikegami [...] adaelis.com - Correspondence added

CC:	perlbug-followup [...] perl.org
Subject:	Re: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Fri, 30 Dec 2011 19:04:22 -0500
To:	bug-Encode [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Fri, Dec 30, 2011 at 7:01 PM, Eric Brine <ikegami@adaelis.com> wrote: Show quoted text

> On Fri, Dec 30, 2011 at 6:44 PM, Linda A Walsh via RT < > bug-Encode@rt.cpan.org> wrote: >

>> <URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > >> >> On Fri Dec 30 17:49:01 2011, ikegami wrote:

>> > Fix: >> > >> > - my $need2slurp = $use_bom{ find_encoding($to)->name }; >> > + my $need2slurp = $use_bom{ find_encoding($from)->name }; >> > + if ($Opt{debug}){ >> > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; >> > + }

>> ===== >> Partly works:

>> > piconv -f UTF-16 -t UTF-8 <test.in >test.out >> > iconv -f UTF-16 -t UTF-8 <test.in >testi.out >> > cmp testi.out test.out && echo ok

>> ok

>> > piconv -f UTF-8 -t UTF-16 <test.out >test2.out >> > cmp testi.in test2.out

>> test.in test2.out differ: byte 1, line 1 >>

>

Correction/elaboration: C<< decode('UTF-16', ...) >> both requires a BOM and removes it Show quoted text

> (intentionally). >

...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be instead of UTF-16le. You need C<< -to UTF-16le >> to use UTF-16le (instead of UTF-16be), but that won't add the BOM, you need to avoid removing it in the first place by using C<< -from UTF-16le >>. - Eric

Fri Dec 30 21:15:04 2011 pause [...] tlinx.org - Correspondence added

On Fri Dec 30 19:04:31 2011, ikegami@adaelis.com wrote: Show quoted text

> On Fri, Dec 30, 2011 at 7:01 PM, Eric Brine <ikegami@adaelis.com> wrote: >

> > On Fri, Dec 30, 2011 at 6:44 PM, Linda A Walsh via RT < > > bug-Encode@rt.cpan.org> wrote: > >

> >> <URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > > >> > >> On Fri Dec 30 17:49:01 2011, ikegami wrote:

> >> > Fix: > >> > > >> > - my $need2slurp = $use_bom{ find_encoding($to)->name }; > >> > + my $need2slurp = $use_bom{ find_encoding($from)->name }; > >> > + if ($Opt{debug}){ > >> > + printf "Read mode: %s\n", $need2slurp ? 'Slurp' : 'Line'; > >> > + }

> >> ===== > >> Partly works:

> >> > piconv -f UTF-16 -t UTF-8 <test.in >test.out > >> > iconv -f UTF-16 -t UTF-8 <test.in >testi.out > >> > cmp testi.out test.out && echo ok

> >> ok

> >> > piconv -f UTF-8 -t UTF-16 <test.out >test2.out > >> > cmp testi.in test2.out

> >> test.in test2.out differ: byte 1, line 1

> > > > Sounds like it might be assuming UTF-16 means BE and not LE?

---- Yup: cmp -l -b test.in test2.out 1 377 M-^? 376 M-~ 2 376 M-~ 377 M-^? Show quoted text

> >

> Correction/elaboration: > > C<< decode('UTF-16', ...) >> both requires a BOM and removes it

> > (intentionally).

--- How is that a correction?? Show quoted text

> ...and C<< encode('UTF-16', ...) >> adds it back, but uses UTF-16be

instead Show quoted text

> of UTF-16le.

----- Ah, then there's two rubs: 1)...why would encode convert to BE on a LE machine? Seems like exactly the wrong decision to make. 2) since piconv states that is "designed to be a drop in replacement for iconv" and "iconv seems to assume LE", (maybe it only does so on LE machines?)... then I would assert there is a still a problem.

Fri Dec 30 23:26:12 2011 ikegami [...] adaelis.com - Correspondence added

CC:	perlbug-followup [...] perl.org
Subject:	Re: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Fri, 30 Dec 2011 23:26:02 -0500
To:	bug-Encode [...] rt.cpan.org
From:	Eric Brine <ikegami [...] adaelis.com>

On Fri, Dec 30, 2011 at 9:15 PM, Linda A Walsh via RT < bug-Encode@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > > How is that a correction?? >

I was correcting what *I* said. 1)...why would encode convert to BE on a LE machine? What does Encode have to do with your machine? 2) since piconv states that is "designed to be a drop in replacement for Show quoted text

> iconv" and "iconv seems to assume LE", (maybe it only does so on LE > machines?)... then I would assert there is a still a problem. >

Yes. Go ahead a file a bug if you want.

Sat Dec 31 02:39:35 2011 pause [...] tlinx.org - Correspondence added

On Fri Dec 30 23:26:12 2011, ikegami@adaelis.com wrote: Show quoted text

> On Fri, Dec 30, 2011 at 9:15 PM, Linda A Walsh via RT < > bug-Encode@rt.cpan.org> wrote: >

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=73623 > > > How is that a correction?? > >

> > I was correcting what *I* said. > > 1)...why would encode convert to BE on a LE machine? > > > What does Encode have to do with your machine?

---- That's where the test was run. Data is usually in the machines native format unless you are specifically trying to export it somewhere (like over the Net, then 'network byte order' is used). Show quoted text

> > 2) since piconv states that is "designed to be a drop in replacement for

> > iconv" and "iconv seems to assume LE", (maybe it only does so on LE > > machines?)... then I would assert there is a still a problem. > >

> > Yes. Go ahead a file a bug if you want.

--- The original test case showed using iconv 2 directions... for some reason the perlbug SW chopped that off .. anything after the uuencoded file I included, ws chopped off... that had a whole explanation and demonstration of the bug using the above data file (above in the original bug report that seems to have been corrupted by perl's bug system). The bug was the piconv didn't work as a drop in for iconv as I took a simple doc and converted to utf-8 and then back to utf-16, and original and the twice converted compared identical. I tried to do the same with piconv, but piconv failed at the first step. Why the original bug report was truncated at the data point, seems to be another bug in the perlbug reporting system. Perhaps it would be better to report that one as this one is still not fixed as the title perl';s conversion fails when iconv succeeds is still true. That's why I said 'closer', but not quite there.

Tue Jan 03 09:00:50 2012 zefram [...] fysh.org - Correspondence added

CC:	perlbug-followup [...] perl.org
Subject:	Re: [rt.cpan.org #73623] [perl #107326] perl's unicode conversion fails when iconv succeeds
Date:	Tue, 3 Jan 2012 14:00:37 +0000
To:	Linda A Walsh via RT <bug-Encode [...] rt.cpan.org>
From:	Zefram <zefram [...] fysh.org>

Linda A Walsh via RT wrote: Show quoted text

>Data is usually in the machines native format unless you are >specifically trying to export it somewhere

That was the case in the 1980s. Times have changed; machines are more interconnected than they used to be. -zefram