Skip Menu |

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information
The Basics
Id: 99730
Status: open
Priority: 0/
Queue: HTML-HTML5-Parser

People
Owner: Nobody in particular
Requestors: dr [...] jones.dk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Fwd: Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser
Date: Wed, 22 Oct 2014 16:28:57 +0200
To: bug-html-html5-parser [...] rt.cpan.org
From: Jonas Smedegaard <dr [...] jones.dk>
Hi, Someone in Debian ran into the issue below, that seems like a bug in your perl module: Forwarded message from Vincent Lefevre (2014-06-08 21:03:03): Show quoted text
> Package: libhtml-html5-parser-perl > Version: 0.301-1 > Severity: important > > (with possible data loss as a consequence) > > Consider the following HTML file: > > <?xml version="1.0" encoding="utf-8"?> > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> > <title>title</title> > </head> > <body> > <p>↓</p> > </body> > </html> > > On this file, the following script > > #!/usr/bin/env perl > > use strict; > use HTML::HTML5::Parser; > > use utf8; # for the characters in the script. > use open ':encoding(UTF-8)'; # for the file arguments. > binmode STDIN, ':encoding(UTF-8)'; # for stdin. > binmode STDOUT, ':encoding(UTF-8)'; # for stdout. > > @ARGV == 1 or die "Usage: $0 <file.html>\n"; > > my $parser = HTML::HTML5::Parser->new; > my $doc = $parser->parse_file($ARGV[0]); > print "Charset: '", $parser->charset($doc), "'\n"; > print $doc->toString(); > > outputs: > > Charset: '' > <?xml version="1.0" encoding="windows-1252"?> > <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> > > If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL > LETTER E WITH ACUTE), then I get: > > Charset: 'utf-8' > <?xml version="1.0" encoding="utf-8"?> > <!--?xml version="1.0" encoding="utf-8"?--> > <html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head> > <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/> > <title>title</title> > </head> > <body> > <p>�</p> > > > </body></html> > > which is also incorrect, but at least the charset is correct. > > -- System Information: > Debian Release: jessie/sid > APT prefers unstable > APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental') > Architecture: amd64 (x86_64) > Foreign Architectures: i386 > > Kernel: Linux 3.11-2-amd64 (SMP w/2 CPU cores) > Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) > Shell: /bin/sh linked to /bin/dash > > Versions of packages libhtml-html5-parser-perl depends on: > ii libhtml-html5-entities-perl 0.003-2 > ii libio-html-perl 1.00-1 > ii libtry-tiny-perl 0.22-1 > ii liburi-perl 1.60-1 > ii libxml-libxml-perl 2.0116+dfsg-1 > ii perl 5.18.2-4 > ii perl-modules [libhttp-tiny-perl] 5.18.2-4 > > libhtml-html5-parser-perl recommends no packages. > > Versions of packages libhtml-html5-parser-perl suggests: > pn libxml-libxml-devel-setlinenumber-perl <none> > > -- no debconf information > > _______________________________________________ > pkg-perl-maintainers mailing list > pkg-perl-maintainers@lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-perl-maintainers
-- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private
Download signature.asc
application/pgp-signature 949b

Message body not shown because it is not plain text.

From: vincent [...] vinc17.net
Note that I already reported the bug: https://rt.cpan.org/Public/Bug/Display.html?id=96399 which now has additional details (and the Debian bug was already forwarded to this bug).
Subject: Re: [rt.cpan.org #99730] Fwd: Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser
Date: Thu, 23 Oct 2014 13:08:02 +0200
To: bug-HTML-HTML5-Parser [...] rt.cpan.org
From: Jonas Smedegaard <dr [...] jones.dk>
Quoting vincent@vinc17.net via RT (2014-10-23 09:45:24) Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=99730 > > > Note that I already reported the bug: > > https://rt.cpan.org/Public/Bug/Display.html?id=96399 > > which now has additional details (and the Debian bug was already > forwarded to this bug).
Oh, silly me - I thought I'd double-checked that, but evidently not :-P Sorry Toby for the noice, - Jonas -- * Jonas Smedegaard - idealist & Internet-arkitekt * Tlf.: +45 40843136 Website: http://dr.jones.dk/ [x] quote me freely [ ] ask before reusing [ ] keep private
Download signature.asc
application/pgp-signature 949b

Message body not shown because it is not plain text.