Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 15068
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: martin [...] snowplow.org
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 3.45
Fixed in: (no value)



Subject: HTML::Parser can't handle certain large characters
HTML::Parser apparently has trouble with some strings with the utf-8 flag set on them if the utf-8 expansion contains the character 0xA0. I believe that this is caused by the fact that 0xA0 is marked as a space in hctype.h, and that at several points in the code space characters are stepped over. Unfortunately, when processing utf-8 code, this leads to a partial utf-8 character being passed along to other methods. This problem can be fixed by modifying hctype.h so that character 160 is not a space, but I'm uncertain of the other consequences of that change. The following code demonstrates the problem - note that the only character it has a problem with is \x0420, which includes an 0xA0 in its utf-8 expansion. #!perl use HTML::Parser; use strict; my $prsr = HTML::Parser->new; my $htmltxt = <<EOF; <html lang="en"> <head> <title>Minimal HTML Document</title> </head> <body> <p>This is a Russian letter: \x{041E}</p> <p>This is another Russian letter: \x{041F}</p> <p>And another: \x{0420}</p> <p>And another: \x{0421}</p> <p>And another: \x{0422}</p> </body> </html> EOF for my $c (split(//,$htmltxt)) { local $SIG{__WARN__} = sub { printf STDERR 'Character %04x%s',ord($c),":\n"; print STDERR @_; }; $prsr->parse($c); } $prsr->eof;
This problem is now fixed in CVS. \xA0 is no longer considered space.