Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 7014
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: jgmyers [...] proofpoint.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 3.36
Fixed in: (no value)



Subject: multiple bugs handling non-ASCII characters
HTML-Parser fails to handle non-ASCII characters in the HTML file being parsed. It fails to examine or copy the UTF8 flag, with the exception of decode_entities(). Following a unicode entity, decode_entities() in UNICODE_ENTITIES mode fails to convert ISO-8859-1 to UTF-8, leading to a result that is not utf8::valid(). hparser.c has hash lookup code that is not UTF8 safe. The attached patch fixes all this.

Message body is not shown because it is too large.

Date: Wed, 21 Jul 2004 17:53:52 -0700
From: John Gardiner Myers <jgmyers [...] proofpoint.com>
To: bug-HTML-Parser [...] rt.cpan.org
Subject: Re: [cpan #7014] AutoReply: multiple bugs handling non-ASCII characters
RT-Send-Cc:
With the previous patch applied, one can remove one of the documented bugs. diff -ru HTML-Parser-3.36/Parser.pm HTML-Parser-3.36-work/Parser.pm --- HTML-Parser-3.36/Parser.pm 2004-04-01 04:05:52.000000000 -0800 +++ HTML-Parser-3.36-work/Parser.pm 2004-07-21 15:32:57.000000000 -0700 @@ -996,10 +996,6 @@ =head1 BUGS -Unicode strings are not parsed correctly. A workaround is to encode -them as UTF-8 before passing them to the HTML::Parser. The C<Encode> -module can do that. - The <style> and <script> sections do not end with the first "</", but need the complete corresponding end tag. MSIE avoids terminating a <script> section if the </script> occurs inside quotes. HTML::Parser
From: Tom Insam
The original patch patched an auto-generated file, I've removed this from the patch, and integrated the documentation page in the previous comment. This applies cleanly and passes tests for me on Darwin (Mac OS X 10.3).

Message body is not shown because it is too large.

From: Tom Insam
Also, I have a test case.
BEGIN { if ($] < 5.006) { print "1..0 # skipped: This perl does not support Unicode\n"; exit; } } use warnings; use strict; use Encode qw( is_utf8 decode ); use HTML::Parser; print "1..2\n"; my $utf8_string = decode('utf8', "\x{c3}\x{a9}"); # e-acute $utf8_string = "<title>$utf8_string</title>"; # this string is UTF8 at the moment. print "not " unless Encode::is_utf8($utf8_string); print "ok 1\n"; my $parser = HTML::Parser->new; $parser->handler( text => sub { my (undef, $text, undef) = @_; # We expect the text parsed out of the HTML to still be UTF8. print "not " unless Encode::is_utf8($text); print "ok 2\n"; } ); $parser->parse($utf8_string);
Subject: Revised fix
From: jgmyers [...] proofpoint.com
The previous patch had an uninitialized variable which would in some situations cause the result to be gratuitously upgraded to utf8.

Message body is not shown because it is too large.

I have now uploaded HTML-Parser-3.39_90 with the proposed patch in it. Please give it a spin.
From: jgmyers [...] proofpoint.com
Remove completed TODO item.
diff -ru HTML-Parser-3.3990-orig/TODO HTML-Parser-3.3990/TODO --- HTML-Parser-3.3990-orig/TODO 2003-08-15 09:47:03.000000000 -0700 +++ HTML-Parser-3.3990/TODO 2004-11-17 11:03:45.000000000 -0800 @@ -3,8 +3,6 @@ - limit the length of markup elements that never end. Perhaps by configurable limits on the length that markup can have and still be recongnized. Report stuff as 'text' when this happens? - - unicode support (when parsing Unicode strings the strings reported - in callbacks should also be Unicode strings). - remove 255 char limit on literal argspec strings - implement backslash escapes in literal argspec string - <![%app1;[...]]> (parameter entities) Only in HTML-Parser-3.3990: TODO~