Bug #7014 for HTML-Parser: multiple bugs handling non-ASCII characters

Mon Jul 19 19:30:13 2004 Guest - Ticket created

Subject:

multiple bugs handling non-ASCII characters

HTML-Parser fails to handle non-ASCII characters in the HTML file being parsed. It fails to examine or copy the UTF8 flag, with the exception of decode_entities(). Following a unicode entity, decode_entities() in UNICODE_ENTITIES mode fails to convert ISO-8859-1 to UTF-8, leading to a result that is not utf8::valid(). hparser.c has hash lookup code that is not UTF8 safe. The attached patch fixes all this.

Message body is not shown because it is too large.

Wed Jul 21 21:49:42 2004 jgmyers [...] proofpoint.com - Correspondence added

Date:	Wed, 21 Jul 2004 17:53:52 -0700
From:	John Gardiner Myers <jgmyers [...] proofpoint.com>
To:	bug-HTML-Parser [...] rt.cpan.org
Subject:	Re: [cpan #7014] AutoReply: multiple bugs handling non-ASCII characters
RT-Send-Cc:

With the previous patch applied, one can remove one of the documented bugs. diff -ru HTML-Parser-3.36/Parser.pm HTML-Parser-3.36-work/Parser.pm --- HTML-Parser-3.36/Parser.pm 2004-04-01 04:05:52.000000000 -0800 +++ HTML-Parser-3.36-work/Parser.pm 2004-07-21 15:32:57.000000000 -0700 @@ -996,10 +996,6 @@ =head1 BUGS -Unicode strings are not parsed correctly. A workaround is to encode -them as UTF-8 before passing them to the HTML::Parser. The C<Encode> -module can do that. - The <style> and <script> sections do not end with the first "</", but need the complete corresponding end tag. MSIE avoids terminating a <script> section if the </script> occurs inside quotes. HTML::Parser

Fri Sep 03 10:14:24 2004 TOMI [...] cpan.org - Correspondence added

From:

Tom Insam

The original patch patched an auto-generated file, I've removed this from the patch, and integrated the documentation page in the previous comment. This applies cleanly and passes tests for me on Darwin (Mac OS X 10.3).

Message body is not shown because it is too large.

Fri Sep 03 10:15:25 2004 TOMI [...] cpan.org - Correspondence added

From:

Tom Insam

Also, I have a test case.

BEGIN { if ($] < 5.006) { print "1..0 # skipped: This perl does not support Unicode\n"; exit; } } use warnings; use strict; use Encode qw( is_utf8 decode ); use HTML::Parser; print "1..2\n"; my $utf8_string = decode('utf8', "\x{c3}\x{a9}"); # e-acute $utf8_string = "<title>$utf8_string</title>"; # this string is UTF8 at the moment. print "not " unless Encode::is_utf8($utf8_string); print "ok 1\n"; my $parser = HTML::Parser->new; $parser->handler( text => sub { my (undef, $text, undef) = @_; # We expect the text parsed out of the HTML to still be UTF8. print "not " unless Encode::is_utf8($text); print "ok 2\n"; } ); $parser->parse($utf8_string);

Tue Nov 02 13:53:00 2004 Guest - Correspondence added

Subject:	Revised fix
From:	jgmyers [...] proofpoint.com

The previous patch had an uninitialized variable which would in some situations cause the result to be gratuitously upgraded to utf8.

Message body is not shown because it is too large.

Wed Nov 17 09:49:18 2004 GAAS [...] cpan.org - Correspondence added

I have now uploaded HTML-Parser-3.39_90 with the proposed patch in it. Please give it a spin.

Wed Nov 17 14:05:35 2004 Guest - Correspondence added

From:

jgmyers [...] proofpoint.com

Remove completed TODO item.

diff -ru HTML-Parser-3.3990-orig/TODO HTML-Parser-3.3990/TODO --- HTML-Parser-3.3990-orig/TODO 2003-08-15 09:47:03.000000000 -0700 +++ HTML-Parser-3.3990/TODO 2004-11-17 11:03:45.000000000 -0800 @@ -3,8 +3,6 @@ - limit the length of markup elements that never end. Perhaps by configurable limits on the length that markup can have and still be recongnized. Report stuff as 'text' when this happens? - - unicode support (when parsing Unicode strings the strings reported - in callbacks should also be Unicode strings). - remove 255 char limit on literal argspec strings - implement backslash escapes in literal argspec string - <![%app1;[...]]> (parameter entities) Only in HTML-Parser-3.3990: TODO~

Mon Nov 29 08:51:54 2004 GAAS [...] cpan.org - Status changed from 'new' to 'resolved'