Bug #17901 for HTML-Parser: HTML::Entities misses at least one Unicode (high bit) Character

Tue Feb 28 15:17:46 2006 Guest - Ticket created

Subject:

HTML::Entities misses at least one Unicode (high bit) Character

I think I've found a problem which causes HTML::Entities to miss an entity when encoding (both numeric and normal). I've attached a TGZ that includes a small snippet of malformed UTF8 and a small test that demonstrates the problem. Here's how I'd show it: % tar xvf missedentity.tgz % ./go.pl > out % vi out The "out" file will contain: Einar [Aacute]gú Frið Of course, the [Aacute] should have been encoded. I know this is easy to say, and very annoying, but given this entity is missing, how many others may also be missing? My system details: Redhat Fedora 4 Perl 5.8.6 HTML::Parser 3.50 HTML::Entities 1.32

Subject:

missedentity.tgz

Download missedentity.tgz
application/x-gzip 451b

Message body not shown because it is not plain text.

Tue Mar 21 06:48:42 2006 GAAS [...] cpan.org - Correspondence added

The file you are reading is Latin-1, not UTF-8. If you change your open() statement to relect this the result is as expected.

--- go.pl.orig 2006-03-21 12:46:24.000000000 +0100 +++ go.pl 2006-03-21 12:46:40.000000000 +0100 @@ -5,7 +5,7 @@ use strict; use warnings; -unless(open(FILE,"<:utf8","dodgytext")) +unless(open(FILE,"<:encoding(latin1)","dodgytext")) { die "Could not open file: $!\n"; }

Tue Mar 21 06:48:44 2006 The RT System itself - Status changed from 'new' to 'open'

Tue Mar 21 06:48:45 2006 GAAS [...] cpan.org - Status changed from 'open' to 'rejected'