Bug #15068 for HTML-Parser: HTML::Parser can't handle certain large characters

Fri Oct 14 23:04:26 2005 Guest - Ticket created

Subject:

HTML::Parser can't handle certain large characters

HTML::Parser apparently has trouble with some strings with the utf-8 flag set on them if the utf-8 expansion contains the character 0xA0. I believe that this is caused by the fact that 0xA0 is marked as a space in hctype.h, and that at several points in the code space characters are stepped over. Unfortunately, when processing utf-8 code, this leads to a partial utf-8 character being passed along to other methods. This problem can be fixed by modifying hctype.h so that character 160 is not a space, but I'm uncertain of the other consequences of that change. The following code demonstrates the problem - note that the only character it has a problem with is \x0420, which includes an 0xA0 in its utf-8 expansion. #!perl use HTML::Parser; use strict; my $prsr = HTML::Parser->new; my $htmltxt = <<EOF; <html lang="en"> <head> <title>Minimal HTML Document</title> </head> <body> This is a Russian letter: \x{041E} This is another Russian letter: \x{041F} And another: \x{0420} And another: \x{0421} And another: \x{0422} </body> </html> EOF for my $c (split(//,$htmltxt)) { local $SIG{__WARN__} = sub { printf STDERR 'Character %04x%s',ord($c),":\n"; print STDERR @_; }; $prsr->parse($c); } $prsr->eof;

Mon Oct 24 06:11:10 2005 GAAS [...] cpan.org - Correspondence added

This problem is now fixed in CVS. \xA0 is no longer considered space.

Mon Oct 24 08:34:28 2005 GAAS [...] cpan.org - Status changed from 'new' to 'resolved'