Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 17901
Status: rejected
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: ralphbolton [...] mail2sexy.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: 3.50



Subject: HTML::Entities misses at least one Unicode (high bit) Character
I think I've found a problem which causes HTML::Entities to miss an entity when encoding (both numeric and normal). I've attached a TGZ that includes a small snippet of malformed UTF8 and a small test that demonstrates the problem. Here's how I'd show it: % tar xvf missedentity.tgz % ./go.pl > out % vi out The "out" file will contain: Einar [Aacute]gú Frið Of course, the [Aacute] should have been encoded. I know this is easy to say, and very annoying, but given this entity is missing, how many others may also be missing? My system details: Redhat Fedora 4 Perl 5.8.6 HTML::Parser 3.50 HTML::Entities 1.32
Subject: missedentity.tgz
Download missedentity.tgz
application/x-gzip 451b

Message body not shown because it is not plain text.

The file you are reading is Latin-1, not UTF-8. If you change your open() statement to relect this the result is as expected.
--- go.pl.orig 2006-03-21 12:46:24.000000000 +0100 +++ go.pl 2006-03-21 12:46:40.000000000 +0100 @@ -5,7 +5,7 @@ use strict; use warnings; -unless(open(FILE,"<:utf8","dodgytext")) +unless(open(FILE,"<:encoding(latin1)","dodgytext")) { die "Could not open file: $!\n"; }