Subject: | HTML::Parser can't handle certain large characters |
HTML::Parser apparently has trouble with some strings with the utf-8 flag set on them if the utf-8 expansion contains the character 0xA0. I believe that this is caused by the fact that 0xA0 is marked as a space in hctype.h, and that at several points in the code space characters are stepped over. Unfortunately, when processing utf-8 code, this leads to a partial utf-8 character being passed along to other methods. This problem can be fixed by modifying hctype.h so that character 160 is not a space, but I'm uncertain of the other consequences of that change.
The following code demonstrates the problem - note that the only character it has a problem with is \x0420, which includes an 0xA0 in its utf-8 expansion.
#!perl
use HTML::Parser;
use strict;
my $prsr = HTML::Parser->new;
my $htmltxt = <<EOF;
<html lang="en">
<head>
<title>Minimal HTML Document</title>
</head>
<body>
<p>This is a Russian letter: \x{041E}</p>
<p>This is another Russian letter: \x{041F}</p>
<p>And another: \x{0420}</p>
<p>And another: \x{0421}</p>
<p>And another: \x{0422}</p>
</body>
</html>
EOF
for my $c (split(//,$htmltxt)) {
local $SIG{__WARN__} = sub {
printf STDERR 'Character %04x%s',ord($c),":\n";
print STDERR @_;
};
$prsr->parse($c);
}
$prsr->eof;