Skip Menu |

This queue is for tickets about the HTML-Strip CPAN distribution.

Report information
The Basics
Id: 42834
Status: resolved
Priority: 0/
Queue: HTML-Strip

People
Owner: Nobody in particular
Requestors: eugenek [...] 45-98.org
Cc: pali [...] cpan.org
AdminCc:

Bug Information
Severity: Important
Broken in: 1.06
Fixed in: 2.00



Subject: HTML::Strip breaks UTF-8
Breaks UTF-8. See attached file. In 1.html - correct utf-8 in russian. Just run 1.pl - it outputs broken utf8. perl v5.8.5 on RHEL 4, v5.8.8 on Etch - same
Subject: html_strip_bug_utf8.zip
Download html_strip_bug_utf8.zip
application/zip 1.8k

Message body not shown because it is not plain text.

From: eugenek [...] 45-98.org
Fixed test case. Looks like "—" thing is the reason of this bug!
Download more_test.zip
application/zip 1.2k

Message body not shown because it is not plain text.

From: pat [...] aers.ca
On Tue Jan 27 12:13:58 2009, gnudist wrote: Show quoted text
> Fixed test case. Looks like "—" thing is the reason of this bug!
I am still seeing "broken" UTF-8. Or, more specifically Double Encoded UTF-8. In the attached example, there are two UTF-8 3 byte characters, and they both turn into 6 byte characters on return. Original: E2 80 99 (RIGHT SINGLE QUOTATION MARK) Returns as: C3 A2 C2 80 C2 99 Original: E2 80 9D (RIGHT SINGLE QUOTATION MARK) Returns as: C3 A2 C2 80 C2 9D
Download broken.tar
application/octet-stream 10k

Message body not shown because it is not plain text.

Will there be a fix someday for this?
Workaround: Show quoted text
----- BEGIN CODE ----- use strict; use warnings; use open ':std', ':locale'; use LWP::UserAgent qw( ); use HTML::Strip qw( ); use HTML::Entities qw( decode_entities ); my $url = $ARGV[0]; my $ua = LWP::UserAgent->new(); my $response = $ua->get($url); die $response->status_line() if !$response->is_success(); my $decoded_html = $response->decoded_content(); my $hs = HTML::Strip->new( decode_entities => 0 ); utf8::encode( my $utf8_html = $decoded_html ); my $utf8_text = $hs->parse( $utf8_html ); utf8::decode( my $decoded_text = $utf8_text ); $decoded_text = decode_entities($decoded_text); print $decoded_text;
----- END CODE -----
On Thu 16. juli 2009 12:20:41, plyn wrote: Show quoted text
> On Tue Jan 27 12:13:58 2009, gnudist wrote:
> > Fixed test case. Looks like "—" thing is the reason of this
bug! Show quoted text
> > I am still seeing "broken" UTF-8. Or, more specifically Double Encoded > UTF-8. > > In the attached example, there are two UTF-8 3 byte characters, and
they Show quoted text
> both turn into 6 byte characters on return. > > Original: E2 80 99 (RIGHT SINGLE QUOTATION MARK) > Returns as: C3 A2 C2 80 C2 99 > > Original: E2 80 9D (RIGHT SINGLE QUOTATION MARK) > Returns as: C3 A2 C2 80 C2 9D >
Easily confirmed: $ perl -wle 'use utf8; use HTML::Strip; my $str = "←↓→"; print "utf8_flag: " . utf8::is_utf8($str); my $str2 = HTML::Strip->new()- Show quoted text
>parse($str); print "utf8_flag: " . utf8::is_utf8($str2);'
utf8_flag: 1 utf8_flag: Work around for real code: use Encode; use utf8; use HTML::Strip; my $str = "←↓→"; my $utf8_was_on = Encode::is_utf8($str); my $str2 = HTML::Strip->new()->parse($str); $utf8_was_on && ($HTML::Strip::VERSION <= 1.06) && Encode::_utf8_on ($str2);
From: ashley [...] netspot.com.au
None of the workarounds work in my case. See my attached test script. If you comment out the "use encoding 'utf8'" line, the encode_utf8() will get the correct string (s²). However with the "use encoding 'utf8'" line there, I can't get the correct string! Even trying all of the above workarounds. Even using HTML::Entities to decode the entities has the same problem!
Subject: testhtmlstrip.pl
use encoding 'utf8'; use Encode; use HTML::Strip; my $htmlstrip = HTML::Strip->new(); my $match = {}; $text = 's&sup2;'; $text = $htmlstrip->parse($text); print "not encoded: " . $text . "\n"; print "encoded: " . encode_utf8($text) . "\n"; print "STRIPPED TEXT: " . $text . "\n";
Subject: possible workaround - HTML::Strip breaks UTF-8
I discussed this in detail with Zefram and ilmari. Here's a possible workaround, which seems to work at least in my case: https://gist.github.com/910818
RT-Send-CC: ashley [...] netspot.com.au, pat [...] aers.ca
On Fri Apr 08 18:03:42 2011, OSFAMERON wrote: Show quoted text
> I discussed this in detail with Zefram and ilmari. Here's a possible > workaround, which seems to > work at least in my case: > > https://gist.github.com/910818
and here's a github repo with that workaround
RT-Send-CC: eugenek [...] 45-98.org
A test case has been added for the input specified and the module rewritten to use libicu. UTF-8 input should now be handled properly.
On Tue Nov 18 11:24:09 2014, KILINRAX wrote: Show quoted text
> A test case has been added for the input specified and the module > rewritten to use libicu. > > UTF-8 input should now be handled properly.
In this bug tracker are all workarounds incorrect. Real problem is that HTML:Strip (prior 2.00) returns utf8 encoded string, but forget to set utf8 flag. So perl thinks it is latin1. Any later encoding/decoding/downgrading/etc.. will just destroy or change characters. So correct workaround for this bug is to set utf8 flag for output if input was utf8: use Encode; use HTML::Strip; use HTML::Entities qw(decode_entities); my $input = ... my $output = HTML::Strip->new(decode_entities => 0)->parse($input); if ( $HTML::Strip::VERSION < 2.00 and Encode::is_utf8($input) ) { Encode::_utf8_on($output); } $output = decode_entities($output); Bug was fixed in HTML::Strip 2.00 and above code handle it.
Just FTR, you can do this w/ out the over head of Encode (or one of its internal functions): Just use utf8::is_utf8() && utf8::decode() respectively. Although there seems to be a slight bug in that code example, it’s essentially saying turn this into a character string if it’s a character string. You probably want turn this into a character string if it’s a bytes string. HTH! On Sat Apr 02 16:48:55 2016, PALI wrote: Show quoted text
> In this bug tracker are all workarounds incorrect. Real problem is > that HTML:Strip (prior 2.00) returns utf8 encoded string, but forget > to set utf8 flag. So perl thinks it is latin1. Any later > encoding/decoding/downgrading/etc.. will just destroy or change > characters. > > So correct workaround for this bug is to set utf8 flag for output if > input was utf8: > > use Encode; > use HTML::Strip; > use HTML::Entities qw(decode_entities); > > my $input = ... > > my $output = HTML::Strip->new(decode_entities => 0)->parse($input); > if ( $HTML::Strip::VERSION < 2.00 and Encode::is_utf8($input) ) { > Encode::_utf8_on($output); > } > $output = decode_entities($output); > > Bug was fixed in HTML::Strip 2.00 and above code handle it.
On Ned Apr 03 12:35:47 2016, DMUEY wrote: Show quoted text
> Just FTR, you can do this w/ out the over head of Encode (or one of > its internal functions): Just use utf8::is_utf8() && utf8::decode() > respectively.
Nope, utf8::decode() will do decode which will damage output. You need to call perl equivalent of SvUTF8_on() function. Show quoted text
> Although there seems to be a slight bug in that code example, it’s > essentially saying turn this into a character string if it’s a > character string.
No bug, this is what is really needed to do. If input is utf8, output is also in utf8. HTML::Strip prior 2.00 just forgot to set utf8 flag, so you need to do it manually.
On Sun Apr 03 12:51:35 2016, PALI wrote: Show quoted text
> On Ned Apr 03 12:35:47 2016, DMUEY wrote:
> > Just FTR, you can do this w/ out the over head of Encode (or one of > > its internal functions): Just use utf8::is_utf8() && utf8::decode() > > respectively.
> > Nope, utf8::decode() will do decode which will damage output. You need > to call perl equivalent of SvUTF8_on() function.
ah ok, so maybe utf8::upgrade()? anywho, glad its solved :)