Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 19478
Status: rejected
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: Ben.Evans [...] morganstanley.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 3.51
Fixed in: (no value)



Subject: HTML-Parser does not recognise <meta http-equiv=""> for charsets
HTML::Parser does not seem to be compatible with non-Western encodings when the encoding is specified via a <meta> tag. A good way to manifest this is via HTML in the ISO-2022-JP charset - see attached sample HTML. The issue here is that ISO-2022-JP when in one of its Japanese modes may contain a byte of value 60 (ASCII '<') as part of a 2-byte character. If the parser is not charset-aware, this will cause Japanese text to be silently munched into a broken tagname (probably until the next instance of byte value 62, '>' or EOL where a sane HTML parser would probably decide it has been fed a seriously mangled bit of tag soup and reset for the next line - which is the observed behaviour of HTML::Parser) Main use cases for this would be HTML parsing when it is not known ahead of time which charset the HTML is written in. Behaviour demonstrated on perl 5.8.4 on Red Hat Linux AS 3.0
Subject: hy-decode2.html

test$B%F%9%H(B

$B5!

$BH>3Q%+%?%+%J(B

 

The text to be parsed need to be decoded before it's passed to HTML::Parser. Use the Encode module to achieve that.