Bug #100969 for HTML-Strip: HTML::Strip doesn't handle latin1 encoded strings correctly

Wed Dec 17 09:58:18 2014 ksm [...] jobindex.dk - Ticket created

Subject:	HTML::Strip doesn't handle latin1 encoded strings correctly
Date:	Wed, 17 Dec 2014 15:58:05 +0100
To:	bug-HTML-Strip [...] rt.cpan.org
From:	"Klaus S. Madsen" <ksm [...] jobindex.dk>

Hi, If HTML::Strip is passed a non-decoded string, appears to handle it as if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded string is in latin1. The attached script will illustrate the problem. If run from an UTF-8 terminal, the output is more readable if the script is run with "perl -CO ./test_html_strip.pl", so that perl automatically UTF-8 encodes the output. The script tries to strip the tags from the following two strings: "ø" and "æ" (the strings in the script escape these two characters just to prevent any encoding confusion). It strips the strings two times each, once for the latin1 encoded string, and once for the decoded string. If HTML::Strip behaved like perl with regards to encoding, there shouldn't have been any differences in the output, between the latin1 encoded string and the decoded string, as illustrated by the programs output. On my Ubuntu 14.10 system with HTML::Strip 2.08 installed, the output from the script is the following: Testing latin1 encoded string: ø [WARN] invalid utf8 char ord=248 Output: ø Testing decoded string: ø Output: ø Testing latin1 encoded string: æ Output: æ Testing decoded string: æ Output: æ Btw. the recent work on making HTML::Strip handle decoded strings is very much appreciated! -- Klaus S. Madsen, Udvikler, ksm@jobindex.dk Jobindex A/S, Holger Danskes Vej 91, 2000 Frederiksberg Tlf +45 38 32 33 55, Dir +45 38 32 33 70 http://www.jobindex.dk/

Message body is not shown because sender requested not to inline it.

Mon Dec 22 12:26:37 2014 KILINRAX [...] cpan.org - Correspondence added

On Wed Dec 17 09:58:18 2014, ksm@jobindex.dk wrote: Show quoted text

> Hi, > > If HTML::Strip is passed a non-decoded string, appears to handle it as > if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded > string is in latin1. > > The attached script will illustrate the problem. If run from an UTF-8 > terminal, the output is more readable if the script is run with "perl > -CO ./test_html_strip.pl", so that perl automatically UTF-8 encodes the > output. > > The script tries to strip the tags from the following two strings: > "ø" and "æ" (the strings in the script escape these two > characters just to prevent any encoding confusion). > > It strips the strings two times each, once for the latin1 encoded > string, and once for the decoded string. If HTML::Strip behaved like > perl with regards to encoding, there shouldn't have been any differences > in the output, between the latin1 encoded string and the decoded string, > as illustrated by the programs output. > > On my Ubuntu 14.10 system with HTML::Strip 2.08 installed, the output > from the script is the following: > > Testing latin1 encoded string: ø > [WARN] invalid utf8 char ord=248 > Output: ø > Testing decoded string: ø > Output: ø > Testing latin1 encoded string: æ > Output: æ > Testing decoded string: æ > Output: æ

The version that's up wasn't designed to work with latin1. It should be possible for me to rewrite it to test the input string for unicode-ness, and handle non-ascii as latin-1 or utf-8, depending. I should probably also update the docs, too. You might have to wait a couple of weeks for a new release, unfortunately, given the time of year. Show quoted text

> Btw. the recent work on making HTML::Strip handle decoded strings is > very much appreciated!

You're very welcome, glad it's proving useful!

Mon Dec 22 12:26:37 2014 The RT System itself - Status changed from 'new' to 'open'

Mon Dec 22 12:27:27 2014 KILINRAX [...] cpan.org - Taken

Tue Jan 06 06:18:01 2015 KILINRAX [...] cpan.org - Correspondence added

RT-Send-CC:

ksm [...] jobindex.dk

On Mon Dec 22 12:26:37 2014, KILINRAX wrote: Show quoted text

> On Wed Dec 17 09:58:18 2014, ksm@jobindex.dk wrote:

> > If HTML::Strip is passed a non-decoded string, appears to handle it as > > if it is UTF8-encoded. Ordinarily perl will assume that a non-decoded > > string is in latin1.

> > The version that's up wasn't designed to work with latin1. It should > be possible for me to rewrite it to test the input string for unicode- > ness, and handle non-ascii as latin-1 or utf-8, depending.

Version 2.09 is now up, which fixes this behaviour.

Tue Jan 06 06:18:04 2015 KILINRAX [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Jan 06 06:18:05 2015 KILINRAX [...] cpan.org - Fixed in 2.09 added