Bug #96399 for HTML-HTML5-Parser: UTF-8 character confuses the parser

RT for rt.cpan.org

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information

The Basics

Id:	96399
Status:	new
Priority:	0/
Queue:	HTML-HTML5-Parser

People

Owner:	Nobody in particular
Requestors:	vincent [...] vinc17.net
Cc:
AdminCc:

Bug Information

Severity:	Important
Broken in:	0.301
Fixed in:	(no value)

History Show all quoted text

Thu Jun 12 06:54:42 2014 vincent [...] vinc17.net - Ticket created

Subject:

UTF-8 character confuses the parser

Bug I've reported on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946 Consider the following HTML file: <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>title</title> </head> <body> <p>↓</p> </body> </html> On this file, the following script #!/usr/bin/env perl use strict; use HTML::HTML5::Parser; use utf8; # for the characters in the script. use open ':encoding(UTF-8)'; # for the file arguments. binmode STDIN, ':encoding(UTF-8)'; # for stdin. binmode STDOUT, ':encoding(UTF-8)'; # for stdout. @ARGV == 1 or die "Usage: $0 <file.html>\n"; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_file($ARGV[0]); print "Charset: '", $parser->charset($doc), "'\n"; print $doc->toString(); outputs: Charset: '' <?xml version="1.0" encoding="windows-1252"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL LETTER E WITH ACUTE), then the encoding is correctly detected.

Wed Oct 22 08:18:56 2014 vincent [...] vinc17.net - Correspondence added

From:

vincent [...] vinc17.net

As a consequence of this bug, html2xhtml doesn't work at all when applied on a file. No problems when the HTML document is provided in the standard input, though. For instance, with test.html as: <!DOCTYPE html> <html><body><p>Test €</p></body></html> I get: $ html2xhtml test.html <?xml version="1.0" encoding="windows-1252"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> $ html2xhtml < test.html <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test €</p> </body></html> and with test.html as: <!DOCTYPE html> <html><body><p>Test é</p></body></html> $ html2xhtml test.html <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test �</p> </body></html> $ html2xhtml < test.html <?xml version="1.0" encoding="utf-8"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body><p>Test é</p> </body></html> parse_file is used in the former test (like in my original bug report), and parse_string is used in the latter test. Thus it seems that it's parse_file that is broken.