Bug #42834 for HTML-Strip: HTML::Strip breaks UTF-8

Tue Jan 27 11:57:46 2009 eugenek [...] 45-98.org - Ticket created

Subject:

HTML::Strip breaks UTF-8

Breaks UTF-8. See attached file. In 1.html - correct utf-8 in russian. Just run 1.pl - it outputs broken utf8. perl v5.8.5 on RHEL 4, v5.8.8 on Etch - same

Subject:

html_strip_bug_utf8.zip

Download html_strip_bug_utf8.zip
application/zip 1.8k

Message body not shown because it is not plain text.

Tue Jan 27 12:13:58 2009 eugenek [...] 45-98.org - Correspondence added

From:

eugenek [...] 45-98.org

Fixed test case. Looks like "—" thing is the reason of this bug!

Download more_test.zip
application/zip 1.2k

Message body not shown because it is not plain text.

Thu Jul 16 12:20:41 2009 pat [...] aers.ca - Correspondence added

From:

pat [...] aers.ca

On Tue Jan 27 12:13:58 2009, gnudist wrote: Show quoted text

> Fixed test case. Looks like "—" thing is the reason of this bug!

I am still seeing "broken" UTF-8. Or, more specifically Double Encoded UTF-8. In the attached example, there are two UTF-8 3 byte characters, and they both turn into 6 byte characters on return. Original: E2 80 99 (RIGHT SINGLE QUOTATION MARK) Returns as: C3 A2 C2 80 C2 99 Original: E2 80 9D (RIGHT SINGLE QUOTATION MARK) Returns as: C3 A2 C2 80 C2 9D

Download broken.tar
application/octet-stream 10k

Message body not shown because it is not plain text.

Thu Jul 16 12:20:43 2009 The RT System itself - Status changed from 'new' to 'open'

Fri Aug 21 18:55:41 2009 perl [...] rainboxx.de - Correspondence added

Will there be a fix someday for this?

Sat Dec 12 05:58:53 2009 IKEGAMI [...] cpan.org - Correspondence added

Workaround: Show quoted text

----- BEGIN CODE ----- use strict; use warnings; use open ':std', ':locale'; use LWP::UserAgent qw( ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; my $ua = LWP::UserAgent->new(); my $response = $ua->get($url); die $response->status_line() if !$response->is_success(); my $decoded_html = $response->decoded_content(); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); print $decoded_text;

----- END CODE -----

Sat Jun 12 21:33:33 2010 mendoza [...] pvv.ntnu.no - Correspondence added

On Thu 16. juli 2009 12:20:41, plyn wrote: Show quoted text

> On Tue Jan 27 12:13:58 2009, gnudist wrote:

> > Fixed test case. Looks like "—" thing is the reason of this

bug! Show quoted text

> > I am still seeing "broken" UTF-8. Or, more specifically Double Encoded > UTF-8. > > In the attached example, there are two UTF-8 3 byte characters, and

they Show quoted text

> both turn into 6 byte characters on return. > > Original: E2 80 99 (RIGHT SINGLE QUOTATION MARK) > Returns as: C3 A2 C2 80 C2 99 > > Original: E2 80 9D (RIGHT SINGLE QUOTATION MARK) > Returns as: C3 A2 C2 80 C2 9D >

Easily confirmed: $ perl -wle 'use utf8; use HTML::Strip; my $str = "←↓→"; print "utf8_flag: " . utf8::is_utf8($str); my $str2 = HTML::Strip->new()- Show quoted text

>parse($str); print "utf8_flag: " . utf8::is_utf8($str2);'

utf8_flag: 1 utf8_flag: Work around for real code: use Encode; use utf8; use HTML::Strip; my $str = "←↓→"; my $utf8_was_on = Encode::is_utf8($str); my $str2 = HTML::Strip->new()->parse($str); $utf8_was_on && ($HTML::Strip::VERSION <= 1.06) && Encode::_utf8_on ($str2);

Wed Jan 12 23:06:46 2011 ashley [...] netspot.com.au - Correspondence added

From:

ashley [...] netspot.com.au

None of the workarounds work in my case. See my attached test script. If you comment out the "use encoding 'utf8'" line, the encode_utf8() will get the correct string (s²). However with the "use encoding 'utf8'" line there, I can't get the correct string! Even trying all of the above workarounds. Even using HTML::Entities to decode the entities has the same problem!

Subject:

testhtmlstrip.pl

use encoding 'utf8'; use Encode; use HTML::Strip; my $htmlstrip = HTML::Strip->new(); my $match = {}; $text = 's²'; $text = $htmlstrip->parse($text); print "not encoded: " . $text . "\n"; print "encoded: " . encode_utf8($text) . "\n"; print "STRIPPED TEXT: " . $text . "\n";

Fri Apr 08 18:03:42 2011 osfameron [...] cpan.org - Correspondence added

Subject:

possible workaround - HTML::Strip breaks UTF-8

I discussed this in detail with Zefram and ilmari. Here's a possible workaround, which seems to work at least in my case: https://gist.github.com/910818

Mon Apr 11 05:01:08 2011 osfameron [...] cpan.org - Correspondence added

RT-Send-CC:

ashley [...] netspot.com.au, pat [...] aers.ca

On Fri Apr 08 18:03:42 2011, OSFAMERON wrote: Show quoted text

> I discussed this in detail with Zefram and ilmari. Here's a possible > workaround, which seems to > work at least in my case: > > https://gist.github.com/910818

and here's a github repo with that workaround

Tue Nov 18 11:24:09 2014 KILINRAX [...] cpan.org - Correspondence added

RT-Send-CC:

eugenek [...] 45-98.org

A test case has been added for the input specified and the module rewritten to use libicu. UTF-8 input should now be handled properly.

Wed Nov 19 05:40:02 2014 KILINRAX [...] cpan.org - Correspondence added

On Tue Nov 18 11:24:09 2014, KILINRAX wrote: Show quoted text

> A test case has been added for the input specified and the module > rewritten to use libicu. > > UTF-8 input should now be handled properly.

Wed Nov 19 05:40:04 2014 KILINRAX [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Nov 19 05:40:05 2014 KILINRAX [...] cpan.org - Fixed in 2.00 added

Sat Apr 02 16:41:15 2016 pali [...] cpan.org - Cc PALI added

Sat Apr 02 16:48:55 2016 pali [...] cpan.org - Correspondence added

In this bug tracker are all workarounds incorrect. Real problem is that HTML:Strip (prior 2.00) returns utf8 encoded string, but forget to set utf8 flag. So perl thinks it is latin1. Any later encoding/decoding/downgrading/etc.. will just destroy or change characters. So correct workaround for this bug is to set utf8 flag for output if input was utf8: use Encode; use HTML::Strip; use HTML::Entities qw(decode_entities); my $input = ... my $output = HTML::Strip->new(decode_entities => 0)->parse($input); if ( $HTML::Strip::VERSION < 2.00 and Encode::is_utf8($input) ) { Encode::_utf8_on($output); } $output = decode_entities($output); Bug was fixed in HTML::Strip 2.00 and above code handle it.

Sun Apr 03 12:35:47 2016 dmuey [...] cpan.org - Correspondence added

Just FTR, you can do this w/ out the over head of Encode (or one of its internal functions): Just use utf8::is_utf8() && utf8::decode() respectively. Although there seems to be a slight bug in that code example, it’s essentially saying turn this into a character string if it’s a character string. You probably want turn this into a character string if it’s a bytes string. HTH! On Sat Apr 02 16:48:55 2016, PALI wrote: Show quoted text

> In this bug tracker are all workarounds incorrect. Real problem is > that HTML:Strip (prior 2.00) returns utf8 encoded string, but forget > to set utf8 flag. So perl thinks it is latin1. Any later > encoding/decoding/downgrading/etc.. will just destroy or change > characters. > > So correct workaround for this bug is to set utf8 flag for output if > input was utf8: > > use Encode; > use HTML::Strip; > use HTML::Entities qw(decode_entities); > > my $input = ... > > my $output = HTML::Strip->new(decode_entities => 0)->parse($input); > if ( $HTML::Strip::VERSION < 2.00 and Encode::is_utf8($input) ) { > Encode::_utf8_on($output); > } > $output = decode_entities($output); > > Bug was fixed in HTML::Strip 2.00 and above code handle it.

Sun Apr 03 12:51:35 2016 pali [...] cpan.org - Correspondence added

On Ned Apr 03 12:35:47 2016, DMUEY wrote: Show quoted text

> Just FTR, you can do this w/ out the over head of Encode (or one of > its internal functions): Just use utf8::is_utf8() && utf8::decode() > respectively.

Nope, utf8::decode() will do decode which will damage output. You need to call perl equivalent of SvUTF8_on() function. Show quoted text

> Although there seems to be a slight bug in that code example, it’s > essentially saying turn this into a character string if it’s a > character string.

No bug, this is what is really needed to do. If input is utf8, output is also in utf8. HTML::Strip prior 2.00 just forgot to set utf8 flag, so you need to do it manually.

Mon Apr 04 10:31:52 2016 dmuey [...] cpan.org - Correspondence added

On Sun Apr 03 12:51:35 2016, PALI wrote: Show quoted text

> On Ned Apr 03 12:35:47 2016, DMUEY wrote:

> > Just FTR, you can do this w/ out the over head of Encode (or one of > > its internal functions): Just use utf8::is_utf8() && utf8::decode() > > respectively.

> > Nope, utf8::decode() will do decode which will damage output. You need > to call perl equivalent of SvUTF8_on() function.

ah ok, so maybe utf8::upgrade()? anywho, glad its solved :)