Subject: | multiple bugs handling non-ASCII characters |
HTML-Parser fails to handle non-ASCII characters in the HTML file being parsed. It fails to examine or copy the UTF8 flag, with the exception of decode_entities(). Following a unicode entity, decode_entities() in UNICODE_ENTITIES mode fails to convert ISO-8859-1 to UTF-8, leading to a result that is not utf8::valid(). hparser.c has hash lookup code that is not UTF8 safe.
The attached patch fixes all this.
Message body is not shown because it is too large.