Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the HTML-Tidy CPAN distribution.

Report information
The Basics
Id: 17451
Status: open
Priority: 0/
Queue: HTML-Tidy

People
Owner: Nobody in particular
Requestors: bhirt [...] mobygames.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



CC: Brian Hirt <bhirt [...] mobygames.com>
Subject: utf8 behavior with HTML::Tidy and clean()
Date: Thu, 2 Feb 2006 19:47:31 -0700
To: andy [...] petdance.com
From: Brian Hirt <bhirt [...] mobygames.com>
Andy, I've been using HTML::Tidy and it's working well. I did run across something that seems like strange behavior with UTF8 string. When you pass a UTF8 string into clean(), clean() returns a non-utf8 string, the data is there in octets, but you have to manually decode the octets into a UTF8 string. So basically my code ends up looking like this: my $tidy = HTML::Tidy->new( { config_file => $tidyConf } ); my $tidyClean = $tidy->clean( $tidyCheck ); $tidyClean = Encode::decode_utf8($tidyClean,Encode::FB_CROAK); I didn't know if this is how the API is supposed to work, but it seems reasonable that if a UTF8 string is supplied to clean, one should be returned. Requiring Encode would require perl 5.8. Currently your module supports perl 5.6. Tidy.pm could easily be changed to have an extra line added (around 258) $cleaned = Encode::decode_utf8($cleaned,Encode::FB_CROAK) if Encode::is_utf8($text); Any thoughts or ideas? --brian
Subject: Re: [rt.cpan.org #17451] utf8 behavior with HTML::Tidy and clean()
Date: Thu, 2 Feb 2006 19:02:13 -0800
To: bug-HTML-Tidy [...] rt.cpan.org
From: Andy Lester <andy [...] petdance.com>
Show quoted text
> Tidy.pm could easily be changed to have an extra line added (around > 258) > > $cleaned = Encode::decode_utf8($cleaned,Encode::FB_CROAK) if > Encode::is_utf8($text);
Sounds reasonable. I'm no utf8 guy, but I do need to revsiit HTML::Tidy some time soon. xoa -- Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance
From: moseley [...] hank.org
Andy, I've run into this problem, too, and I think this needs to be addresses. One problem is that the tidy config can be used to re-encode the html, so probably need some way to specify tidy's output encoding so that it can be decoded correctly. Can't depend on the input incoding to be the same as the output encoding.
This behavior is the same, if you use ISO-8859-1 for example. I put in ISO and get back UTF8 (intern the utf-flag is changed).
Another encoding problem: it seems to be, that Tidy wants ISO. If you put in UTF the result is corrupt (unreadable higher chars).
From: paul [...] city-fan.org
On Thu Feb 02 22:00:08 2006, bhirt@mobygames.com wrote: Show quoted text
> I've been using HTML::Tidy and it's working well. I did run across > something that seems like strange behavior with UTF8 string. When > you pass a UTF8 string into clean(), clean() returns a non-utf8 > string, the data is there in octets, but you have to manually decode > the octets into a UTF8 string. So basically my code ends up looking > like this: > > my $tidy = HTML::Tidy->new( { config_file => $tidyConf } ); > my $tidyClean = $tidy->clean( $tidyCheck ); > $tidyClean = Encode::decode_utf8($tidyClean,Encode::FB_CROAK); > > > I didn't know if this is how the API is supposed to work, but it > seems reasonable that if a UTF8 string is supplied to clean, one > should be returned. Requiring Encode would require perl 5.8. > Currently your module supports perl 5.6. > > Tidy.pm could easily be changed to have an extra line added (around 258) > > $cleaned = Encode::decode_utf8($cleaned,Encode::FB_CROAK) if > Encode::is_utf8($text);
HTML::Tidy 1.50 (and probably some of the older releases) actually require perl 5.8.1 as they use utf8:is_utf8. Switching to Encode::is_utf8 would help for users of 5.8.0. Having built HTML::Tidy for a bunch of different Red Hat perl versions, I found that the problem described in this ticket manifested itself as test failures in t/unicode.t for perl versions < 5.8.5. The attached patch uses Brian's approach (albeit patching the test rather than the HTML::Tidy code itself) to resolve the issue and gets the package building for everything back to RHEL3 with perl 5.8.0, although I resorted to skipping the unicode test in 5.8.0.
Subject: HTML-Tidy-1.50-unicode.patch
--- HTML-Tidy-1.50/lib/HTML/Tidy.pm 2010-02-16 18:00:11.000000000 +0000 +++ HTML-Tidy-1.50/lib/HTML/Tidy.pm 2010-02-22 14:32:34.044482810 +0000 @@ -4,6 +4,7 @@ use strict; use warnings; use Carp (); +use Encode; use HTML::Tidy::Message; @@ -219,7 +220,7 @@ } my $html = join( '', @_ ); - utf8::encode($html) unless utf8::is_utf8($html); + utf8::encode($html) unless Encode::is_utf8($html); my ($errorblock,$newline) = _tidy_messages( $html, $self->{config_file}, $self->{tidy_options} @@ -304,7 +305,7 @@ } my $text = join( '', @_ ); - utf8::encode($text) unless utf8::is_utf8($text); + utf8::encode($text) unless Encode::is_utf8($text); if ( defined $text ) { $text .= "\n"; } --- HTML-Tidy-1.50/t/unicode.t 2010-02-16 15:48:40.000000000 +0000 +++ HTML-Tidy-1.50/t/unicode.t 2010-02-22 15:04:37.023464001 +0000 @@ -4,9 +4,14 @@ use warnings; use strict; -use Test::More tests => 7; +use Encode; +use Test::More; +# utf8:is_utf8 only available from perl 5.8.1 BEGIN { + plan skip_all => "This test requires perl >= 5.8.1" + if $] < 5.008001; + plan tests => 7; use_ok( 'HTML::Tidy' ); } @@ -29,6 +34,8 @@ ok(utf8::is_utf8($reference), 'reference is utf8'); my $clean = $tidy->clean( $html ); +# Need to manually reconvert to utf8 with perl < 5.8.5 (CPAN RT#17451) +$clean = Encode::decode_utf8($clean, Encode::FB_CROAK) if $^V lt 5.8.5; ok(utf8::is_utf8($clean), 'cleaned output is also unicode'); $clean =~ s/"HTML Tidy.+w3\.org"/"Tidy"/;