Skip Menu |

This queue is for tickets about the HTML-Strip CPAN distribution.

Report information
The Basics
Id: 100969
Status: resolved
Priority: 0/
Queue: HTML-Strip

People
Owner: KILINRAX [...] cpan.org
Requestors: ksm [...] jobindex.dk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 2.09



Subject: HTML::Strip doesn't handle latin1 encoded strings correctly
Date: Wed, 17 Dec 2014 15:58:05 +0100
To: bug-HTML-Strip [...] rt.cpan.org
From: "Klaus S. Madsen" <ksm [...] jobindex.dk>
Hi, If HTML::Strip is passed a non-decoded string, appears to handle it as if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded string is in latin1. The attached script will illustrate the problem. If run from an UTF-8 terminal, the output is more readable if the script is run with "perl -CO ./test_html_strip.pl", so that perl automatically UTF-8 encodes the output. The script tries to strip the tags from the following two strings: "<p>ø</p>" and "<p>æ</p>" (the strings in the script escape these two characters just to prevent any encoding confusion). It strips the strings two times each, once for the latin1 encoded string, and once for the decoded string. If HTML::Strip behaved like perl with regards to encoding, there shouldn't have been any differences in the output, between the latin1 encoded string and the decoded string, as illustrated by the programs output. On my Ubuntu 14.10 system with HTML::Strip 2.08 installed, the output from the script is the following: Testing latin1 encoded string: <p>ø</p> [WARN] invalid utf8 char ord=248 Output: ø Testing decoded string: <p>ø</p> Output: ø Testing latin1 encoded string: <p>æ</p> Output: æ</p> Testing decoded string: <p>æ</p> Output: æ Btw. the recent work on making HTML::Strip handle decoded strings is very much appreciated! -- Klaus S. Madsen, Udvikler, ksm@jobindex.dk Jobindex A/S, Holger Danskes Vej 91, 2000 Frederiksberg Tlf +45 38 32 33 55, Dir +45 38 32 33 70 http://www.jobindex.dk/

Message body is not shown because sender requested not to inline it.

On Wed Dec 17 09:58:18 2014, ksm@jobindex.dk wrote: Show quoted text
> Hi, > > If HTML::Strip is passed a non-decoded string, appears to handle it as > if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded > string is in latin1. > > The attached script will illustrate the problem. If run from an UTF-8 > terminal, the output is more readable if the script is run with "perl > -CO ./test_html_strip.pl", so that perl automatically UTF-8 encodes the > output. > > The script tries to strip the tags from the following two strings: > "<p>ø</p>" and "<p>æ</p>" (the strings in the script escape these two > characters just to prevent any encoding confusion). > > It strips the strings two times each, once for the latin1 encoded > string, and once for the decoded string. If HTML::Strip behaved like > perl with regards to encoding, there shouldn't have been any differences > in the output, between the latin1 encoded string and the decoded string, > as illustrated by the programs output. > > On my Ubuntu 14.10 system with HTML::Strip 2.08 installed, the output > from the script is the following: > > Testing latin1 encoded string: <p>ø</p> > [WARN] invalid utf8 char ord=248 > Output: ø > Testing decoded string: <p>ø</p> > Output: ø > Testing latin1 encoded string: <p>æ</p> > Output: æ</p> > Testing decoded string: <p>æ</p> > Output: æ
The version that's up wasn't designed to work with latin1. It should be possible for me to rewrite it to test the input string for unicode-ness, and handle non-ascii as latin-1 or utf-8, depending. I should probably also update the docs, too. You might have to wait a couple of weeks for a new release, unfortunately, given the time of year. Show quoted text
> Btw. the recent work on making HTML::Strip handle decoded strings is > very much appreciated!
You're very welcome, glad it's proving useful!
RT-Send-CC: ksm [...] jobindex.dk
On Mon Dec 22 12:26:37 2014, KILINRAX wrote: Show quoted text
> On Wed Dec 17 09:58:18 2014, ksm@jobindex.dk wrote:
> > If HTML::Strip is passed a non-decoded string, appears to handle it as > > if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded > > string is in latin1.
> > The version that's up wasn't designed to work with latin1. It should > be possible for me to rewrite it to test the input string for unicode- > ness, and handle non-ascii as latin-1 or utf-8, depending.
Version 2.09 is now up, which fixes this behaviour.