Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 7785
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: jgmyers [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 3.36
Fixed in: (no value)



Subject: Warning messages when parsing questionable entities
When parsing the text: �� one gets warnings: UTF-16 surrogate 0xdbc0 at [...] UTF-16 surrogate 0xdc85 at [...] There are two issues here. One, while this encoding is highly questionable, it would be good if it could be interpreted the same as "􀂅". Two, when there is an unpaired surrogate or an illegal character (such as "") there should be no warning. It probably should interpret all such junk as �, "REPLACEMENT CHARACTER".
From: jgmyers [...] proofpoint.com
Proposed fix
diff -u HTML-Parser-3.36-utf8/util.c HTML-Parser-3.36-work/util.c --- HTML-Parser-3.36-utf8/util.c 2004-09-27 19:01:40.000000000 -0700 +++ HTML-Parser-3.36-work/util.c 2004-11-01 14:15:38.000000000 -0800 @@ -76,6 +76,7 @@ #ifdef UNICODE_ENTITIES char buf[UTF8_MAXLEN]; int repl_utf8; + int high_surrogate = 0; #else char buf[1]; #endif @@ -133,7 +134,30 @@ repl_utf8 = 0; } else { - char *tmp = uvuni_to_utf8(buf, num); + char *tmp; + if ((num & 0xFFFFFC00) == 0xDC00) { + if (high_surrogate != 0) { + t -= 3; /* Back up past 0xFFFD */ + num = ((high_surrogate - 0xD800) << 10) + + (num - 0xDC00) + 0x10000; + } else { + num = 0xFFFD; + } + } + + if ((num & 0xFFFFFC00) == 0xD800) { + high_surrogate = num; + num = 0xFFFD; + } + else { + high_surrogate = 0; + } + + if ((num >= 0xFDD0 && num <= 0xFDEF) || + ((num & 0xFFFE) == 0xFFFE)) { + num = 0xFFFD; + } + tmp = uvuni_to_utf8(buf, num); repl = buf; repl_len = tmp - buf; repl_utf8 = 1; @@ -160,6 +184,9 @@ #endif } } +#ifdef UNICODE_ENTITIES + high_surrogate = 0; +#endif } if (repl) { @@ -169,6 +196,10 @@ t--; /* '&' already copied, undo it */ #ifdef UNICODE_ENTITIES + if (*s != '&') { + high_surrogate = 0; + } + if (!SvUTF8(sv) && repl_utf8) { STRLEN len = t - SvPVX(sv); if (len) { Only in HTML-Parser-3.36-work/: util.c~
Why did you make chars (num >= 0xFDD0 && num <= 0xFDEF) replaced? Making ((num & 0xFFFE) == 0xFFFE)) illegal is wrong as it matchs 0x10FFFF and similar. Perl itself has the same bug.
Why did you make chars (num >= 0xFDD0 && num <= 0xFDEF) replaced? Making ((num & 0xFFFE) == 0xFFFE)) illegal is wrong as it matchs 0x10FFFF and similar. Perl itself has the same bug.
For now I've applied this modification of your patch.
Index: util.c =================================================================== RCS file: /cvsroot/libwww-perl/html-parser/util.c,v retrieving revision 2.18 retrieving revision 2.19 diff -u -p -u -r2.18 -r2.19 --- util.c 14 Sep 2004 13:47:16 -0000 2.18 +++ util.c 8 Nov 2004 12:54:57 -0000 2.19 @@ -1,4 +1,4 @@ -/* $Id: util.c,v 2.18 2004/09/14 13:47:16 gisle Exp $ +/* $Id: util.c,v 2.19 2004/11/08 12:54:57 gisle Exp $ * * Copyright 1999-2001, Gisle Aas. * @@ -76,6 +76,7 @@ decode_entities(pTHX_ SV* sv, HV* entity #ifdef UNICODE_ENTITIES char buf[UTF8_MAXLEN]; int repl_utf8; + int high_surrogate = 0; #else char buf[1]; #endif @@ -138,7 +139,30 @@ decode_entities(pTHX_ SV* sv, HV* entity repl_utf8 = 0; } else { - char *tmp = uvuni_to_utf8(buf, num); + char *tmp; + if ((num & 0xFFFFFC00) == 0xDC00) { /* low-surrogate */ + if (high_surrogate != 0) { + t -= 3; /* Back up past 0xFFFD */ + num = ((high_surrogate - 0xD800) << 10) + + (num - 0xDC00) + 0x10000; + high_surrogate = 0; + } else { + num = 0xFFFD; + } + } + else if ((num & 0xFFFFFC00) == 0xD800) { /* high-surrogate */ + high_surrogate = num; + num = 0xFFFD; + } + else { + high_surrogate = 0; + /* otherwise invalid? */ + if (num == 0xFFFE || num == 0xFFFF || num > 0x1F0000) { + num = 0xFFFD; + } + } + + tmp = uvuni_to_utf8(buf, num); repl = buf; repl_len = tmp - buf; repl_utf8 = 1; @@ -165,6 +189,9 @@ decode_entities(pTHX_ SV* sv, HV* entity #endif } } +#ifdef UNICODE_ENTITIES + high_surrogate = 0; +#endif } if (repl) { @@ -174,6 +201,10 @@ decode_entities(pTHX_ SV* sv, HV* entity t--; /* '&' already copied, undo it */ #ifdef UNICODE_ENTITIES + if (*s != '&') { + high_surrogate = 0; + } + if (!SvUTF8(sv) && repl_utf8) { STRLEN len = t - SvPVX(sv); if (len) { Index: t/uentities.t =================================================================== RCS file: /cvsroot/libwww-perl/html-parser/t/uentities.t,v retrieving revision 1.6 retrieving revision 1.7 diff -u -p -u -r1.6 -r1.7 --- t/uentities.t 3 Oct 2003 14:50:08 -0000 1.6 +++ t/uentities.t 8 Nov 2004 12:55:06 -0000 1.7 @@ -14,7 +14,7 @@ unless (&HTML::Entities::UNICODE_SUPPORT exit; } -print "1..10\n"; +print "1..13\n"; print "not " unless decode_entities("&euro") eq "\x{20AC}"; print "ok 1\n"; @@ -25,18 +25,18 @@ print "ok 2\n"; print "not " unless decode_entities("&#500000") eq chr(500000); print "ok 3\n"; -{ - no warnings 'utf8'; # These are illegal unicode chars - print "not " unless decode_entities("&#xFFFF") eq "\x{FFFF}"; - print "ok 4\n"; - - print "not " unless decode_entities("&#x10FFFF") eq chr(0x10FFFF); - print "ok 5\n"; +print "not " unless decode_entities("&#xFFFF") eq "\x{FFFD}"; +print "ok 4\n"; - print "not " unless decode_entities("&#XFFFFFFFF") eq chr(0xFFFF_FFFF); - print "ok 6\n"; +{ + no warnings 'utf8'; # workaround for perl bug + print "not " unless decode_entities("&#x10FFFF") eq chr(0x10FFFF); + print "ok 5\n"; } +print "not " unless decode_entities("&#XFFFFFFFF") eq chr(0xFFFD); +print "ok 6\n"; + print "not " unless decode_entities("&#0") eq "\0" && decode_entities("&#0;") eq "\0" && decode_entities("&#x0") eq "\0" && @@ -77,3 +77,11 @@ print "not " if $err; print "ok 10\n"; +print "not " unless decode_entities("&#56256;&#56453;") eq chr(0x100085); +print "ok 11\n"; + +print "not " unless decode_entities("&#56256;&#56453;") eq chr(0x100085); +print "ok 12\n"; + +print "not " unless decode_entities("&#56256") eq chr(0xFFFD); +print "ok 13\n";
[GAAS - Mon Nov 8 08:04:12 2004]: Show quoted text
> Why did you make chars (num >= 0xFDD0 && num <= 0xFDEF) replaced? > > Making ((num & 0xFFFE) == 0xFFFE)) illegal is wrong as it > matchs 0x10FFFF and similar. Perl itself has the same bug.
The Unicode book I had was for Unicode 3.0. It looks like Unicode 3.1 does make all of these noncharacters, so I guess Perl is right after all. I'll modify the patch to match.