Bug #118913 for HTML-HTML5-Parser: A bug in the Perl module

Subject:	A bug in the Perl module
Date:	Wed, 23 Nov 2016 10:38:12 +0200
To:	bug-HTML-HTML5-Parser [...] rt.cpan.org
From:	Dmitry Korolyov <d.koroliov [...] gmail.com>

Good day. For one thing, I would like to thank you for that useful and indispensable module, but I've found a bug in it (though I'm not sure, whether this is really a bug or my fault). The module seems to handle html entities incorrectly, at least one entity -   When I parse a string (no matter from file or directly from a variable) the module converts &nbsp to the character itself but in the iso-8859-1 encoding which is then handled as utf-8 by the module itself. So when I get a parsed string I have 'Â ' instead of the nbsp character. Here is an example script: #!/usr/bin/perl use strict; use warnings; use HTML::HTML5::Parser qw(); my $raw_str = '<!doctype html> <html> <head> <meta charset="utf-8"> <title>a bug report</title> </head> <body> <div>                                                 error conditions</div> </body> </html>'; my $parsed_str = HTML::HTML5::Parser->new->parse_string($raw_str, {encoding => 'utf-8'}); open (my $fh, '>:encoding(UTF-8)', 'bug-html5-parser.html'); print $fh $parsed_str; -- I have Perl v. 5.20.2, Ubuntu 15.04 disto. Thank you, D.Koroliov.