Bug #64371 for Encode: Update for 5.14 API changes

Fri Dec 31 16:29:28 2010 khw [...] cpan.org - Ticket created

Subject:

Update for 5.14 API changes

Hi Dan, The attached patch changes two .xs files to compensate for the planned changes in the core API for 5.14. After much discussion on p5p, the consensus is that to fix various misbehaviors with utf8, the best solution is to change the API somewhat. This solution requires the fewest source code changes. In fact, the attached changes are the only ones required in all of CPAN. Whether changes to Encode should go first through you, or first through blead has not been discussed on p5p. A third option is first-come, first-served, so that either your or p5p can create patches. If you have an opinion on this, please let me or the p5p email list know. The patches try to keep the external behavior as close as possible to the current one; I've tried to give a full explanation in the commit message. The severity is 'critical' because we can't make the changes to blead until these are applied. It would be good to get these into the next 5.13.x release of blead, which means it would need to be applied within the next couple weeks. If you would rather I do that, I can, even if you would rather keep control in general. Thanks for maintaining Encode!

Subject:

0002-Update-for-core-5.14-API-changes.patch

From a8678e5edb9e8ddcdaaa291aa906c8308b6a9ec2 Mon Sep 17 00:00:00 2001 From: Karl Williamson <public@khwilliamson.com> Date: Fri, 31 Dec 2010 11:12:51 -0700 Subject: [PATCH] Update for core 5.14 API changes This patch updates two .xs files to accommodate the changes that are planned in the Perl 5.14 core, while maintaining backward source compatibility so that Encode can be used with earlier Perls. The core is changing so that all code points that fit in a UV are legal and do not raise warnings or errors when converted to/from utf8, except during input and output. This means that a 0 passed in the flags fields to the routines that do this conversion means to accept any UVs instead of forbidding all potentially problematic ones. That means that any code that doesn't want to accept certain categories must change to use new flags that specify that behavior. Since Encode is used for I/O, it needs to change so that it can continue to check for these potentially problematic code points. The categories that are changing are the surrogates, the non-character code points (such as 0xFFFF, 0xFFFE, ...), the code points above the legal Unicode maximum of 0x10FFFF (called supers), and the code points way above that maximum, above what was the original utf8 limit of 0x7FFFFFFF. The two affected functions are utf8n_to_uvuni() and uvuni_to_utf8_flags(). Both .xs files are changed to have #ifdefs and #defines so they can work on Perls which don't have the latest flags #defined. Besides that, here are the changes and their rationales. First, for Encode.xs, it only calls one of the affected functions, utf8n_to_utf8(). One of the calls uses the flag UTF8_ALLOW_ANY. That flag will continue to have the same meaning as currently, so no change is needed here. The other two calls use the Encode #defines UTF8_ALLOW_STRICT and UTF8_ALLOW_NONSTRICT. The effect of the first define is to exclude the 4 categories. The effect of the other is to allow them. The only needed changes to the file are to adjust the definitions of these to get as close to the existing behavior as desirable. The definition of nonstrict doesn't have to change, so the only change needed is to UTF8_ALLOW_STRICT to disallow the 4 categories.. The pre-5.14 behavior of utf8n_to_utf8() with respect to the 4 categories is to allow the supers without complaint, but to disallow the other three unless a flag was set otherwise. However, it only knew about 1 of the 66 non-characters, 0xFFFF, so it mistakenly allowed the other 65 through without complaint. Encode realizes that the supers are allowed through, and so has special code to disallow them when strictness is required, but the 65 noncharacters are allowed through. The 5.14 behavior is to know about all these. Thus the special code to disallow supers is not needed when run under 5.14. And, for the first time, under strict utf8ness, all the non-characters will correctly be excluded, not just the one. Note, though, that under non-strict rules, accepting the non-character U+FFEF, is considered a security hole, as it can apparently be used to make a program think the byte ordering of the machine is the opposite of what it really is. I don't know it there have been successful attacks using this mechanism. This patch closes this potential vulnerability under strict utf8; and does not address it under non-strict. Second, for Unicode.xs, it calls both the affected functions. In the single call to utf8n_to_uvuni(), it currently uses 0 for the flags, which causes the pre-5.14 function to not allow surrogates nor the very large supers, nor the single non-character 0xFFFF; and to warn if any of these conditions are found, provided UTF8 warnings are not lexically disabled. This patch keeps this behavior except it now allows 0xFFFF, which brings that code point into sync with the other 65 points; but all 66 will warn. If it's desired to exclude all 66, the flag UTF8_DISALLOW_NONCHAR can be passed. Note that supers are continued to be accepted without complaint. The single call to uvuni_to_utf8_flags() currently uses 0 for the flags, which causes the pre-5.14 function to accept all 4 categories, but to warn, if enabled, on all 4. The pre-5.14 function knows about all 66 noncharacters. The patch just changes the flags to get the identical behavior in 5.14. --- cpan/Encode/Encode.xs | 7 ++++++- cpan/Encode/Unicode/Unicode.xs | 19 +++++++++++++++++-- 2 files changed, 23 insertions(+), 3 deletions(-) diff --git a/cpan/Encode/Encode.xs b/cpan/Encode/Encode.xs index 1a672d6..ffaddb7 100644 --- a/cpan/Encode/Encode.xs +++ b/cpan/Encode/Encode.xs @@ -29,7 +29,12 @@ UNIMPLEMENTED(_encoded_utf8_to_bytes, I32) UNIMPLEMENTED(_encoded_bytes_to_utf8, I32) -#define UTF8_ALLOW_STRICT 0 +#ifdef UTF8_DISALLOW_ILLEGAL_INTERCHANGE +# define UTF8_ALLOW_STRICT UTF8_DISALLOW_ILLEGAL_INTERCHANGE +#else +# define UTF8_ALLOW_STRICT 0 +#endif + #define UTF8_ALLOW_NONSTRICT (UTF8_ALLOW_ANY & \ ~(UTF8_ALLOW_CONTINUATION | \ UTF8_ALLOW_NON_CONTINUATION | \ diff --git a/cpan/Encode/Unicode/Unicode.xs b/cpan/Encode/Unicode/Unicode.xs index 9741626..61d25a2 100644 --- a/cpan/Encode/Unicode/Unicode.xs +++ b/cpan/Encode/Unicode/Unicode.xs @@ -18,6 +18,16 @@ #define isLoSurrogate(x) (0xDC00 <= (x) && (x) <= 0xDFFF ) #define invalid_ucs2(x) ( issurrogate(x) || 0xFFFF < (x) ) +/* For pre-5.14 source compatibility */ +#ifndef UNICODE_WARN_ILLEGAL_INTERCHANGE +# define UNICODE_WARN_ILLEGAL_INTERCHANGE 0 +# define UTF8_DISALLOW_SURROGATE 0 +# define UTF8_WARN_SURROGATE 0 +# define UTF8_DISALLOW_FE_FF 0 +# define UTF8_WARN_FE_FF 0 +# define UTF8_WARN_NONCHAR 0 +#endif + #define PERLIO_BUFSIZ 1024 /* XXX value comes from PerlIOEncode_get_base */ /* Avoid wasting too much space in the result buffer */ @@ -255,7 +265,8 @@ CODE: resultbuflen = SvLEN(result); } - d = uvuni_to_utf8_flags(resultbuf+SvCUR(result), ord, 0); + d = uvuni_to_utf8_flags(resultbuf+SvCUR(result), ord, + UNICODE_WARN_ILLEGAL_INTERCHANGE); SvCUR_set(result, d - (U8 *)SvPVX(result)); } @@ -323,7 +334,11 @@ CODE: } while (s < e && s+UTF8SKIP(s) <= e) { STRLEN len; - UV ord = utf8n_to_uvuni(s, e-s, &len, 0); + UV ord = utf8n_to_uvuni(s, e-s, &len, (UTF8_DISALLOW_SURROGATE + |UTF8_WARN_SURROGATE + |UTF8_DISALLOW_FE_FF + |UTF8_WARN_FE_FF + |UTF8_WARN_NONCHAR)); s += len; if (size != 4 && invalid_ucs2(ord)) { if (!issurrogate(ord)) { -- 1.5.6.3

Fri Dec 31 17:56:12 2010 DANKOGAI [...] cpan.org - Correspondence added

khw, A happy new year. Thank you so much. Your patch is important enough to $Encode::VERSION++. Repositories updated. PAUSE submitted. Dan the Maintainer Thereof On Fri Dec 31 16:29:28 2010, khw wrote: Show quoted text

> Hi Dan, > > The attached patch changes two .xs files to compensate for the planned > changes in the core API for 5.14. After much discussion on p5p, the > consensus is that to fix various misbehaviors with utf8, the best > solution is to change the API somewhat. This solution requires the > fewest source code changes. In fact, the attached changes are the only > ones required in all of CPAN. > > Whether changes to Encode should go first through you, or first through > blead has not been discussed on p5p. A third option is first-come, > first-served, so that either your or p5p can create patches. If you > have an opinion on this, please let me or the p5p email list know. > > The patches try to keep the external behavior as close as possible to > the current one; I've tried to give a full explanation in the commit > message. > > The severity is 'critical' because we can't make the changes to blead > until these are applied. It would be good to get these into the next > 5.13.x release of blead, which means it would need to be applied within > the next couple weeks. If you would rather I do that, I can, even if > you would rather keep control in general. > > Thanks for maintaining Encode!

Fri Dec 31 17:56:12 2010 The RT System itself - Status changed from 'new' to 'open'

Fri Dec 31 17:56:13 2010 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'

Sat Jan 01 12:10:47 2011 khw [...] cpan.org - Correspondence added

Subject:	Re: [rt.cpan.org #64371] Update for 5.14 API changes
Date:	Sat, 01 Jan 2011 10:10:28 -0700
To:	bug-Encode [...] rt.cpan.org
From:	Karl WIlliamson <corporate [...] khwilliamson.com>

Dan Kogai via RT wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=64371 > > > khw, > > A happy new year. Thank you so much. Your patch is important enough to > $Encode::VERSION++. Repositories updated. PAUSE submitted.

It's new year here now. Thanks for the lightning quick turn around. I'm going to submit two more patches that are unrelated to this issue and have much lower priority, so can wait to be considered at your leisure.

Sat Jan 01 12:10:48 2011 The RT System itself - Status changed from 'resolved' to 'open'

Sat May 21 18:15:11 2011 DANKOGAI [...] cpan.org - Correspondence added

Ouch. The reply unintentionally reopened the ticket. As you know it's in 2.42 so closing now. Dan the Maintainer Thereof On Sat Jan 01 12:10:47 2011, khw wrote: Show quoted text

> Dan Kogai via RT wrote:

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=64371 > > > > > khw, > > > > A happy new year. Thank you so much. Your patch is important enough to > > $Encode::VERSION++. Repositories updated. PAUSE submitted.

> > It's new year here now. Thanks for the lightning quick turn around. > I'm going to submit two more patches that are unrelated to this issue > and have much lower priority, so can wait to be considered at your leisure.

Sat May 21 18:15:12 2011 DANKOGAI [...] cpan.org - Status changed from 'open' to 'resolved'