Subject: | Update for 5.14 API changes |
Hi Dan,
The attached patch changes two .xs files to compensate for the planned
changes in the core API for 5.14. After much discussion on p5p, the
consensus is that to fix various misbehaviors with utf8, the best
solution is to change the API somewhat. This solution requires the
fewest source code changes. In fact, the attached changes are the only
ones required in all of CPAN.
Whether changes to Encode should go first through you, or first through
blead has not been discussed on p5p. A third option is first-come,
first-served, so that either your or p5p can create patches. If you
have an opinion on this, please let me or the p5p email list know.
The patches try to keep the external behavior as close as possible to
the current one; I've tried to give a full explanation in the commit
message.
The severity is 'critical' because we can't make the changes to blead
until these are applied. It would be good to get these into the next
5.13.x release of blead, which means it would need to be applied within
the next couple weeks. If you would rather I do that, I can, even if
you would rather keep control in general.
Thanks for maintaining Encode!
Subject: | 0002-Update-for-core-5.14-API-changes.patch |
From a8678e5edb9e8ddcdaaa291aa906c8308b6a9ec2 Mon Sep 17 00:00:00 2001
From: Karl Williamson <public@khwilliamson.com>
Date: Fri, 31 Dec 2010 11:12:51 -0700
Subject: [PATCH] Update for core 5.14 API changes
This patch updates two .xs files to accommodate the changes that are
planned in the Perl 5.14 core, while maintaining backward source
compatibility so that Encode can be used with earlier Perls.
The core is changing so that all code points that fit in a UV are legal
and do not raise warnings or errors when converted to/from utf8, except
during input and output. This means that a 0 passed in the flags fields to
the routines that do this conversion means to accept any UVs instead of
forbidding all potentially problematic ones. That means that any code
that doesn't want to accept certain categories must change to use new
flags that specify that behavior. Since Encode is used for I/O, it
needs to change so that it can continue to check for these potentially
problematic code points. The categories that are changing are the
surrogates, the non-character code points (such as 0xFFFF, 0xFFFE, ...),
the code points above the legal Unicode maximum of 0x10FFFF (called
supers), and the code points way above that maximum, above what was the
original utf8 limit of 0x7FFFFFFF.
The two affected functions are utf8n_to_uvuni() and
uvuni_to_utf8_flags().
Both .xs files are changed to have #ifdefs and #defines so they can work
on Perls which don't have the latest flags #defined.
Besides that, here are the changes and their rationales.
First, for Encode.xs, it only calls one of the affected functions,
utf8n_to_utf8(). One of the calls uses the flag UTF8_ALLOW_ANY.
That flag will continue to have the same meaning as currently, so no
change is needed here. The other two calls use the Encode #defines
UTF8_ALLOW_STRICT and UTF8_ALLOW_NONSTRICT. The effect of the first
define is to exclude the 4 categories. The effect of the other is to
allow them. The only needed changes to the file are to adjust the
definitions of these to get as close to the existing behavior as
desirable. The definition of nonstrict doesn't have to change, so the
only change needed is to UTF8_ALLOW_STRICT to disallow the 4
categories.. The pre-5.14 behavior of utf8n_to_utf8() with respect to
the 4 categories is to allow the supers without complaint, but to
disallow the other three unless a flag was set otherwise. However, it
only knew about 1 of the 66 non-characters, 0xFFFF, so it mistakenly
allowed the other 65 through without complaint. Encode realizes that
the supers are allowed through, and so has special code to disallow them
when strictness is required, but the 65 noncharacters are allowed
through. The 5.14 behavior is to know about all these. Thus the
special code to disallow supers is not needed when run under 5.14. And,
for the first time, under strict utf8ness, all the non-characters will
correctly be excluded, not just the one. Note, though, that under
non-strict rules, accepting the non-character U+FFEF, is considered a
security hole, as it can apparently be used to make a program think the
byte ordering of the machine is the opposite of what it really is. I
don't know it there have been successful attacks using this mechanism.
This patch closes this potential vulnerability under strict utf8; and
does not address it under non-strict.
Second, for Unicode.xs, it calls both the affected functions. In the
single call to utf8n_to_uvuni(), it currently uses 0 for the flags,
which causes the pre-5.14 function to not allow surrogates nor the very
large supers, nor the single non-character 0xFFFF; and to warn if any of
these conditions are found, provided UTF8 warnings are not lexically
disabled. This patch keeps this behavior except it now allows 0xFFFF,
which brings that code point into sync with the other 65 points; but all
66 will warn. If it's desired to exclude all 66, the flag
UTF8_DISALLOW_NONCHAR can be passed. Note that supers are continued to
be accepted without complaint.
The single call to uvuni_to_utf8_flags() currently uses 0 for the flags,
which causes the pre-5.14 function to accept all 4 categories, but to
warn, if enabled, on all 4. The pre-5.14 function knows about all 66
noncharacters. The patch just changes the flags to get the identical
behavior in 5.14.
---
cpan/Encode/Encode.xs | 7 ++++++-
cpan/Encode/Unicode/Unicode.xs | 19 +++++++++++++++++--
2 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/cpan/Encode/Encode.xs b/cpan/Encode/Encode.xs
index 1a672d6..ffaddb7 100644
--- a/cpan/Encode/Encode.xs
+++ b/cpan/Encode/Encode.xs
@@ -29,7 +29,12 @@
UNIMPLEMENTED(_encoded_utf8_to_bytes, I32)
UNIMPLEMENTED(_encoded_bytes_to_utf8, I32)
-#define UTF8_ALLOW_STRICT 0
+#ifdef UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+# define UTF8_ALLOW_STRICT UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+#else
+# define UTF8_ALLOW_STRICT 0
+#endif
+
#define UTF8_ALLOW_NONSTRICT (UTF8_ALLOW_ANY & \
~(UTF8_ALLOW_CONTINUATION | \
UTF8_ALLOW_NON_CONTINUATION | \
diff --git a/cpan/Encode/Unicode/Unicode.xs b/cpan/Encode/Unicode/Unicode.xs
index 9741626..61d25a2 100644
--- a/cpan/Encode/Unicode/Unicode.xs
+++ b/cpan/Encode/Unicode/Unicode.xs
@@ -18,6 +18,16 @@
#define isLoSurrogate(x) (0xDC00 <= (x) && (x) <= 0xDFFF )
#define invalid_ucs2(x) ( issurrogate(x) || 0xFFFF < (x) )
+/* For pre-5.14 source compatibility */
+#ifndef UNICODE_WARN_ILLEGAL_INTERCHANGE
+# define UNICODE_WARN_ILLEGAL_INTERCHANGE 0
+# define UTF8_DISALLOW_SURROGATE 0
+# define UTF8_WARN_SURROGATE 0
+# define UTF8_DISALLOW_FE_FF 0
+# define UTF8_WARN_FE_FF 0
+# define UTF8_WARN_NONCHAR 0
+#endif
+
#define PERLIO_BUFSIZ 1024 /* XXX value comes from PerlIOEncode_get_base */
/* Avoid wasting too much space in the result buffer */
@@ -255,7 +265,8 @@ CODE:
resultbuflen = SvLEN(result);
}
- d = uvuni_to_utf8_flags(resultbuf+SvCUR(result), ord, 0);
+ d = uvuni_to_utf8_flags(resultbuf+SvCUR(result), ord,
+ UNICODE_WARN_ILLEGAL_INTERCHANGE);
SvCUR_set(result, d - (U8 *)SvPVX(result));
}
@@ -323,7 +334,11 @@ CODE:
}
while (s < e && s+UTF8SKIP(s) <= e) {
STRLEN len;
- UV ord = utf8n_to_uvuni(s, e-s, &len, 0);
+ UV ord = utf8n_to_uvuni(s, e-s, &len, (UTF8_DISALLOW_SURROGATE
+ |UTF8_WARN_SURROGATE
+ |UTF8_DISALLOW_FE_FF
+ |UTF8_WARN_FE_FF
+ |UTF8_WARN_NONCHAR));
s += len;
if (size != 4 && invalid_ucs2(ord)) {
if (!issurrogate(ord)) {
--
1.5.6.3