Bug #19478 for HTML-Parser: HTML-Parser does not recognise <meta http-equiv=""> for charsets

Thu May 25 11:07:54 2006 Guest - Ticket created

Subject:

HTML-Parser does not recognise <meta http-equiv=""> for charsets

HTML::Parser does not seem to be compatible with non-Western encodings when the encoding is specified via a <meta> tag. A good way to manifest this is via HTML in the ISO-2022-JP charset - see attached sample HTML. The issue here is that ISO-2022-JP when in one of its Japanese modes may contain a byte of value 60 (ASCII '<') as part of a 2-byte character. If the parser is not charset-aware, this will cause Japanese text to be silently munched into a broken tagname (probably until the next instance of byte value 62, '>' or EOL where a sane HTML parser would probably decide it has been fed a seriously mangled bit of tag soup and reset for the next line - which is the observed behaviour of HTML::Parser) Main use cases for this would be HTML parsing when it is not known ahead of time which charset the HTML is written in. Behaviour demonstrated on perl 5.8.4 on Red Hat Linux AS 3.0

Subject:

hy-decode2.html

test$B%F%9%H(B

$B5!

$BH>3Q%+%?%+%J(B

Mon Nov 17 04:59:13 2008 GAAS [...] cpan.org - Correspondence added

The text to be parsed need to be decoded before it's passed to HTML::Parser. Use the Encode module to achieve that.

Mon Nov 17 04:59:16 2008 The RT System itself - Status changed from 'new' to 'open'

Mon Nov 17 04:59:18 2008 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'