Bug #107043 for Encode: If no BOM is found, the routine dies.

Fri Sep 11 16:41:56 2015 damian.lukowski [...] credativ.de - Ticket created

Subject:	If no BOM is found, the routine dies.
Date:	Fri, 11 Sep 2015 22:41:42 +0200
To:	bug-Encode [...] rt.cpan.org
From:	Damian Lukowski <damian.lukowski [...] credativ.de>

Hello, the Encode::Unicode documentation states the following: Show quoted text

> When BE or LE is omitted during decode(), it checks if BOM is at the beginning of the string; if one is found, the endianness is set to what the BOM says. If no BOM is found, the routine dies.

What is the justification for dying? The Unicode Standard Version 8.0 and RFC2781 define what to do with UTF-16 with no BOM. Unicode Standard excerpt: Show quoted text

> The UTF-16 encoding scheme may or may not begin with a BOM. However, > when there is no BOM, and in the absence of a higher-level protocol, the byte > order of the UTF-16 encoding scheme is big-endian.

RFC2781: Show quoted text

> If the first two octets of the text is not 0xFE followed by > 0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be > interpreted as being big-endian.

Regards Damian

Sat Sep 12 05:14:14 2015 dom [...] cpan.org - Correspondence added

Attached is a version of the patch the OP submitted on http://bugs.debian.org/798727, rebased against master, in case it's useful. I make no comment about its correctness. Cheers, Dominic.

Subject:

0001-When-no-BOM-is-found-use-big-endian-fallback.patch

From 74a7e40bcc5982189edc184e1cb39e3551aa7a91 Mon Sep 17 00:00:00 2001 From: Damian Lukowski <damian.lukowski@credativ.de> Date: Sat, 12 Sep 2015 01:40:29 +0200 Subject: [PATCH] When no BOM is found, use big-endian fallback RFC2781 and the Unicode Standard version 8.0: The UTF-16 encoding scheme may or may not begin with BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. If the first two octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be interpreted as big-endian. Bug: https://rt.cpan.org/Ticket/Display.html?id=107043 [patch rebased and commit message supplied by Dominic Hargreaves] --- Unicode/Unicode.xs | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/Unicode/Unicode.xs b/Unicode/Unicode.xs index 5f3bceb..6004f1e 100644 --- a/Unicode/Unicode.xs +++ b/Unicode/Unicode.xs @@ -166,9 +166,18 @@ CODE: endian = 'V'; } else { - croak("%"SVf":Unrecognised BOM %"UVxf, - *hv_fetch((HV *)SvRV(obj),"Name",4,0), - bom); + /* No BOM found, use big-endian fallback as specified in + * RFC2781 and the Unicode Standard version 8.0: + * + * The UTF-16 encoding scheme may or may not begin with + * a BOM. However, when there is no BOM, and in the + * absence of a higher-level protocol, the byte order + * of the UTF-16 encoding scheme is big-endian. + * + * If the first two octets of the text is not 0xFE + * followed by 0xFF, and is not 0xFF followed by 0xFE, + * then the text SHOULD be interpreted as big-endian. + */ } } #if 1 -- 2.1.4

Sat Sep 12 05:14:14 2015 The RT System itself - Status changed from 'new' to 'open'

Sat Sep 12 06:43:42 2015 damian.lukowski [...] credativ.de - Correspondence added

Subject:	Re: [rt.cpan.org #107043] If no BOM is found, the routine dies.
Date:	Sat, 12 Sep 2015 12:43:14 +0200
To:	bug-Encode [...] rt.cpan.org
From:	Damian Lukowski <damian.lukowski [...] credativ.de>

Unfortunately my first patch does not account for the first two octets when they are not a BOM. In that case one needs to reset the read pointer to the beginning. root@d5305a0f945d:~# cat check-unicode.pl use Encode qw/encode decode/; my $str = 'ABCD'; printf "%s vs %s\n", $str, decode('utf-16be', encode('utf-16be', $str)); printf "%s vs %s\n", $str, decode('utf-16', encode('utf-16', $str)); printf "%s vs %s\n", $str, decode('utf-16', encode('utf-16be', $str)); root@d5305a0f945d:~# perl check-unicode.pl # debian version ABCD vs ABCD ABCD vs ABCD UTF-16:Unrecognised BOM 41 at /usr/lib/x86_64-linux-gnu/perl/5.20/Encode.pm line 175. root@d5305a0f945d:~# perl check-unicode.pl # first version of patch ABCD vs ABCD ABCD vs ABCD ABCD vs BCD root@d5305a0f945d:~# perl check-unicode.pl # second version of patch ABCD vs ABCD ABCD vs ABCD ABCD vs ABCD diff --git a/Unicode/Unicode.xs b/Unicode/Unicode.xs index 5f3bceb..e309307 100644 --- a/Unicode/Unicode.xs +++ b/Unicode/Unicode.xs @@ -166,9 +166,19 @@ CODE: endian = 'V'; } else { - croak("%"SVf":Unrecognised BOM %"UVxf, - *hv_fetch((HV *)SvRV(obj),"Name",4,0), - bom); + /* No BOM found, use big-endian fallback as specified in + * RFC2781 and the Unicode Standard version 8.0: + * + * The UTF-16 encoding scheme may or may not begin with + * a BOM. However, when there is no BOM, and in the + * absence of a higher-level protocol, the byte order + * of the UTF-16 encoding scheme is big-endian. + * + * If the first two octets of the text is not 0xFE + * followed by 0xFF, and is not 0xFF followed by 0xFE, + * then the text SHOULD be interpreted as big-endian. + */ + s -= size; } } #if 1

Tue Sep 15 09:51:00 2015 DANKOGAI [...] cpan.org - Correspondence added

Thank you. Your patch is in as: https://github.com/dankogai/p5-encode/commit/27682d02f7ac0669043faeb419dd5a104eecfb73 Dan the Maintainer Thereof On Sat Sep 12 06:43:42 2015, damian.lukowski@credativ.de wrote: Show quoted text

> Unfortunately my first patch does not account for the first two octets > when they are not a BOM. In that case one needs to reset the read > pointer to the beginning. > > root@d5305a0f945d:~# cat check-unicode.pl > > use Encode qw/encode decode/; > my $str = 'ABCD'; > printf "%s vs %s\n", $str, decode('utf-16be', encode('utf-16be', $str)); > printf "%s vs %s\n", $str, decode('utf-16', encode('utf-16', $str)); > printf "%s vs %s\n", $str, decode('utf-16', encode('utf-16be', $str)); > > > root@d5305a0f945d:~# perl check-unicode.pl # debian version > ABCD vs ABCD > ABCD vs ABCD > UTF-16:Unrecognised BOM 41 at > /usr/lib/x86_64-linux-gnu/perl/5.20/Encode.pm line 175. > > root@d5305a0f945d:~# perl check-unicode.pl # first version of patch > ABCD vs ABCD > ABCD vs ABCD > ABCD vs BCD > > root@d5305a0f945d:~# perl check-unicode.pl # second version of patch > ABCD vs ABCD > ABCD vs ABCD > ABCD vs ABCD > > > diff --git a/Unicode/Unicode.xs b/Unicode/Unicode.xs > index 5f3bceb..e309307 100644 > --- a/Unicode/Unicode.xs > +++ b/Unicode/Unicode.xs > @@ -166,9 +166,19 @@ CODE: > endian = 'V'; > } > else { > - croak("%"SVf":Unrecognised BOM %"UVxf, > - *hv_fetch((HV *)SvRV(obj),"Name",4,0), > - bom); > + /* No BOM found, use big-endian fallback as specified in > + * RFC2781 and the Unicode Standard version 8.0: > + * > + * The UTF-16 encoding scheme may or may not begin with > + * a BOM. However, when there is no BOM, and in the > + * absence of a higher-level protocol, the byte order > + * of the UTF-16 encoding scheme is big-endian. > + * > + * If the first two octets of the text is not 0xFE > + * followed by 0xFF, and is not 0xFF followed by 0xFE, > + * then the text SHOULD be interpreted as big-endian. > + */ > + s -= size; > } > } > #if 1

Tue Sep 15 09:52:12 2015 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sun Nov 22 14:15:09 2015 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Tue Sep 15 09:51:00 2015, DANKOGAI wrote: Show quoted text

> Thank you. Your patch is in as: > > https://github.com/dankogai/p5- > encode/commit/27682d02f7ac0669043faeb419dd5a104eecfb73

This unfortunately broke my code. I was depending on the well-documented feature that it dies when there is no BOM. Then my code would fall back to its own algorithm to determine the byte order. Now it always falls back to BE, which is wrong half the time. The main problem is that you have now changed what was a well-documented feature for at least fifteen years. How many other people are also depending on it?

Mon Nov 23 05:23:49 2015 damian.lukowski [...] credativ.de - Correspondence added

Subject:	Re: [rt.cpan.org #107043] If no BOM is found, the routine dies.
Date:	Mon, 23 Nov 2015 11:23:37 +0100
To:	bug-Encode [...] rt.cpan.org
From:	Damian Lukowski <damian.lukowski [...] credativ.de>

Hi, amavis failed to process UTF-16-no-BOM Mail Content with the former version of Encode::Unicode, which caused the calling MTA to defer such mails indefinitely. One cannot blame amavis for this, as those mails don't violate any specification. Amavis uses Encode::Unicode indirectly via MIME::Parser so it is not even fixable there. One could have argued to deal with the problem within MIME::Tools or MIME::WordDecoder if the fallback behaviour was only defined in MIME context. However, it is also defined in the Unicode standard itself, so fixing it in Encode::Unicode is the only sensible option in my opinion. Regards Damian Am 22.11.2015 um 20:15 schrieb Father Chrysostomos via RT: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=107043 > > > On Tue Sep 15 09:51:00 2015, DANKOGAI wrote:

>> Thank you. Your patch is in as: >> >> https://github.com/dankogai/p5- >> encode/commit/27682d02f7ac0669043faeb419dd5a104eecfb73

> > This unfortunately broke my code. I was depending on the well-documented feature that it dies when there is no BOM. Then my code would fall back to its own algorithm to determine the byte order. Now it always falls back to BE, which is wrong half the time. > > The main problem is that you have now changed what was a well-documented feature for at least fifteen years. How many other people are also depending on it?

Mon Nov 23 05:31:26 2015 damian.lukowski [...] credativ.de - Correspondence added

Subject:	Re: [rt.cpan.org #107043] If no BOM is found, the routine dies.
Date:	Mon, 23 Nov 2015 11:31:17 +0100
To:	bug-Encode [...] rt.cpan.org
From:	Damian Lukowski <damian.lukowski [...] credativ.de>

Under which definition is this wrong? Am 22.11.2015 um 20:15 schrieb Father Chrysostomos via RT: Show quoted text

> Now it always falls back to BE, which is wrong half the time.

Regards Damian

Mon Nov 23 18:49:07 2015 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Mon Nov 23 05:31:26 2015, damian.lukowski@credativ.de wrote: Show quoted text

> Under which definition is this wrong? > > Am 22.11.2015 um 20:15 schrieb Father Chrysostomos via RT:

> > Now it always falls back to BE, which is wrong half the time.

>

In CSS, a file beginning with "\@\0c\0" is to be treated as UTF16-LE, even in the absence of a BOM. But the encoding specified later on the line (@charset "utf16") must be validated to match the actual encoding of the file. This is where I was using the encoding specified in the file to decode the first line, for the sake of validation, and to avoid listing all eight or so spellings of utf16-le (ucs2, ucs-2-le, etc.) in my code. I was relying on the fact that Encode died (as documented), and then tacking "-le" on to the end and trying a second time. If the previous behaviour caused the real-world problem you described, that’s fine. I’ll just change my code to work around the new behaviour.