Skip Menu |

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information
The Basics
Id: 118913
Status: new
Priority: 0/
Queue: HTML-HTML5-Parser

People
Owner: perl [...] toby.ink
Requestors: d.koroliov [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: A bug in the Perl module
Date: Wed, 23 Nov 2016 10:38:12 +0200
To: bug-HTML-HTML5-Parser [...] rt.cpan.org
From: Dmitry Korolyov <d.koroliov [...] gmail.com>
Good day. For one thing, I would like to thank you for that useful and indispensable module, but I've found a bug in it (though I'm not sure, whether this is really a bug or my fault). The module seems to handle html entities incorrectly, at least one entity - &nbsp; When I parse a string (no matter from file or directly from a variable) the module converts &nbsp to the character itself but in the iso-8859-1 encoding which is then handled as utf-8 by the module itself. So when I get a parsed string I have 'Â ' instead of the nbsp character. Here is an example script: #!/usr/bin/perl use strict; use warnings; use HTML::HTML5::Parser qw(); my $raw_str = '<!doctype html> <html> <head> <meta charset="utf-8"> <title>a bug report</title> </head> <body> <div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;error conditions</div> </body> </html>'; my $parsed_str = HTML::HTML5::Parser->new->parse_string($raw_str, {encoding => 'utf-8'}); open (my $fh, '>:encoding(UTF-8)', 'bug-html5-parser.html'); print $fh $parsed_str; -- I have Perl v. 5.20.2, Ubuntu 15.04 disto. Thank you, D.Koroliov.