Subject: | Fwd: Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser |
Date: | Wed, 22 Oct 2014 16:28:57 +0200 |
To: | bug-html-html5-parser [...] rt.cpan.org |
From: | Jonas Smedegaard <dr [...] jones.dk> |
Hi,
Someone in Debian ran into the issue below, that seems like a bug in
your perl module:
Forwarded message from Vincent Lefevre (2014-06-08 21:03:03):
Show quoted text
> Package: libhtml-html5-parser-perl
> Version: 0.301-1
> Severity: important
>
> (with possible data loss as a consequence)
>
> Consider the following HTML file:
>
> <?xml version="1.0" encoding="utf-8"?>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> <title>title</title>
> </head>
> <body>
> <p>↓</p>
> </body>
> </html>
>
> On this file, the following script
>
> #!/usr/bin/env perl
>
> use strict;
> use HTML::HTML5::Parser;
>
> use utf8; # for the characters in the script.
> use open ':encoding(UTF-8)'; # for the file arguments.
> binmode STDIN, ':encoding(UTF-8)'; # for stdin.
> binmode STDOUT, ':encoding(UTF-8)'; # for stdout.
>
> @ARGV == 1 or die "Usage: $0 <file.html>\n";
>
> my $parser = HTML::HTML5::Parser->new;
> my $doc = $parser->parse_file($ARGV[0]);
> print "Charset: '", $parser->charset($doc), "'\n";
> print $doc->toString();
>
> outputs:
>
> Charset: ''
> <?xml version="1.0" encoding="windows-1252"?>
> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>
>
> If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL
> LETTER E WITH ACUTE), then I get:
>
> Charset: 'utf-8'
> <?xml version="1.0" encoding="utf-8"?>
> <!--?xml version="1.0" encoding="utf-8"?-->
> <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head>
> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
> <title>title</title>
> </head>
> <body>
> <p>�</p>
>
>
> </body></html>
>
> which is also incorrect, but at least the charset is correct.
>
> -- System Information:
> Debian Release: jessie/sid
> APT prefers unstable
> APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
> Architecture: amd64 (x86_64)
> Foreign Architectures: i386
>
> Kernel: Linux 3.11-2-amd64 (SMP w/2 CPU cores)
> Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/dash
>
> Versions of packages libhtml-html5-parser-perl depends on:
> ii libhtml-html5-entities-perl 0.003-2
> ii libio-html-perl 1.00-1
> ii libtry-tiny-perl 0.22-1
> ii liburi-perl 1.60-1
> ii libxml-libxml-perl 2.0116+dfsg-1
> ii perl 5.18.2-4
> ii perl-modules [libhttp-tiny-perl] 5.18.2-4
>
> libhtml-html5-parser-perl recommends no packages.
>
> Versions of packages libhtml-html5-parser-perl suggests:
> pn libxml-libxml-devel-setlinenumber-perl <none>
>
> -- no debconf information
>
> _______________________________________________
> pkg-perl-maintainers mailing list
> pkg-perl-maintainers@lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-perl-maintainers
--
* Jonas Smedegaard - idealist & Internet-arkitekt
* Tlf.: +45 40843136 Website: http://dr.jones.dk/
[x] quote me freely [ ] ask before reusing [ ] keep private
Message body not shown because it is not plain text.