Subject: | HTML::Strip doesn't handle latin1 encoded strings correctly |
Date: | Wed, 17 Dec 2014 15:58:05 +0100 |
To: | bug-HTML-Strip [...] rt.cpan.org |
From: | "Klaus S. Madsen" <ksm [...] jobindex.dk> |
Hi,
If HTML::Strip is passed a non-decoded string, appears to handle it as
if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded
string is in latin1.
The attached script will illustrate the problem. If run from an UTF-8
terminal, the output is more readable if the script is run with "perl
-CO ./test_html_strip.pl", so that perl automatically UTF-8 encodes the
output.
The script tries to strip the tags from the following two strings:
"<p>ø</p>" and "<p>æ</p>" (the strings in the script escape these two
characters just to prevent any encoding confusion).
It strips the strings two times each, once for the latin1 encoded
string, and once for the decoded string. If HTML::Strip behaved like
perl with regards to encoding, there shouldn't have been any differences
in the output, between the latin1 encoded string and the decoded string,
as illustrated by the programs output.
On my Ubuntu 14.10 system with HTML::Strip 2.08 installed, the output
from the script is the following:
Testing latin1 encoded string: <p>ø</p>
[WARN] invalid utf8 char ord=248
Output: ø
Testing decoded string: <p>ø</p>
Output: ø
Testing latin1 encoded string: <p>æ</p>
Output: æ</p>
Testing decoded string: <p>æ</p>
Output: æ
Btw. the recent work on making HTML::Strip handle decoded strings is
very much appreciated!
--
Klaus S. Madsen, Udvikler, ksm@jobindex.dk
Jobindex A/S, Holger Danskes Vej 91, 2000 Frederiksberg
Tlf +45 38 32 33 55, Dir +45 38 32 33 70
http://www.jobindex.dk/
Message body is not shown because sender requested not to inline it.