Subject: | HTML-Parser does not recognise <meta http-equiv=""> for charsets |
HTML::Parser does not seem to be compatible with non-Western encodings
when the encoding is specified via a <meta> tag.
A good way to manifest this is via HTML in the ISO-2022-JP charset - see
attached sample HTML.
The issue here is that ISO-2022-JP when in one of its Japanese modes may
contain a byte of value 60 (ASCII '<') as part of a 2-byte character.
If the parser is not charset-aware, this will cause Japanese text to be
silently munched into a broken tagname (probably until the next instance
of byte value 62, '>' or EOL where a sane HTML parser would probably
decide it has been fed a seriously mangled bit of tag soup and reset for
the next line - which is the observed behaviour of HTML::Parser)
Main use cases for this would be HTML parsing when it is not known ahead
of time which charset the HTML is written in.
Behaviour demonstrated on perl 5.8.4 on Red Hat Linux AS 3.0
Subject: | hy-decode2.html |
test$B%F%9%H(B
$B5!
$BH>3Q%+%?%+%J(B