Subject: | A bug in the Perl module |
Date: | Wed, 23 Nov 2016 10:38:12 +0200 |
To: | bug-HTML-HTML5-Parser [...] rt.cpan.org |
From: | Dmitry Korolyov <d.koroliov [...] gmail.com> |
Good day. For one thing, I would like to thank you for that useful and
indispensable module, but I've found a bug in it (though I'm not sure,
whether this is really a bug or my fault).
The module seems to handle html entities incorrectly, at least one entity -
When I parse a string (no matter from file or directly from a variable) the
module converts   to the character itself but in the iso-8859-1
encoding which is then handled as utf-8 by the module itself. So when I get
a parsed string I have 'Â ' instead of the nbsp character.
Here is an example script:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::HTML5::Parser qw();
my $raw_str = '<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>a bug report</title>
</head>
<body>
<div>
error
conditions</div>
</body>
</html>';
my $parsed_str =
HTML::HTML5::Parser->new->parse_string($raw_str, {encoding => 'utf-8'});
open (my $fh, '>:encoding(UTF-8)', 'bug-html5-parser.html');
print $fh $parsed_str;
--
I have Perl v. 5.20.2, Ubuntu 15.04 disto.
Thank you,
D.Koroliov.