Subject: | UTF-8 character confuses the parser |
Bug I've reported on https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=750946
Consider the following HTML file:
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>title</title>
</head>
<body>
<p>↓</p>
</body>
</html>
On this file, the following script
#!/usr/bin/env perl
use strict;
use HTML::HTML5::Parser;
use utf8; # for the characters in the script.
use open ':encoding(UTF-8)'; # for the file arguments.
binmode STDIN, ':encoding(UTF-8)'; # for stdin.
binmode STDOUT, ':encoding(UTF-8)'; # for stdout.
@ARGV == 1 or die "Usage: $0 <file.html>\n";
my $parser = HTML::HTML5::Parser->new;
my $doc = $parser->parse_file($ARGV[0]);
print "Charset: '", $parser->charset($doc), "'\n";
print $doc->toString();
outputs:
Charset: ''
<?xml version="1.0" encoding="windows-1252"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>
If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL LETTER E WITH ACUTE), then the encoding is correctly detected.