On Thu Feb 02 22:00:08 2006, bhirt@mobygames.com wrote:
Show quoted text> I've been using HTML::Tidy and it's working well. I did run across
> something that seems like strange behavior with UTF8 string. When
> you pass a UTF8 string into clean(), clean() returns a non-utf8
> string, the data is there in octets, but you have to manually decode
> the octets into a UTF8 string. So basically my code ends up looking
> like this:
>
> my $tidy = HTML::Tidy->new( { config_file => $tidyConf } );
> my $tidyClean = $tidy->clean( $tidyCheck );
> $tidyClean = Encode::decode_utf8($tidyClean,Encode::FB_CROAK);
>
>
> I didn't know if this is how the API is supposed to work, but it
> seems reasonable that if a UTF8 string is supplied to clean, one
> should be returned. Requiring Encode would require perl 5.8.
> Currently your module supports perl 5.6.
>
> Tidy.pm could easily be changed to have an extra line added (around 258)
>
> $cleaned = Encode::decode_utf8($cleaned,Encode::FB_CROAK) if
> Encode::is_utf8($text);
HTML::Tidy 1.50 (and probably some of the older releases) actually
require perl 5.8.1 as they use utf8:is_utf8. Switching to
Encode::is_utf8 would help for users of 5.8.0.
Having built HTML::Tidy for a bunch of different Red Hat perl versions,
I found that the problem described in this ticket manifested itself as
test failures in t/unicode.t for perl versions < 5.8.5. The attached
patch uses Brian's approach (albeit patching the test rather than the
HTML::Tidy code itself) to resolve the issue and gets the package
building for everything back to RHEL3 with perl 5.8.0, although I
resorted to skipping the unicode test in 5.8.0.
--- HTML-Tidy-1.50/lib/HTML/Tidy.pm 2010-02-16 18:00:11.000000000 +0000
+++ HTML-Tidy-1.50/lib/HTML/Tidy.pm 2010-02-22 14:32:34.044482810 +0000
@@ -4,6 +4,7 @@
use strict;
use warnings;
use Carp ();
+use Encode;
use HTML::Tidy::Message;
@@ -219,7 +220,7 @@
}
my $html = join( '', @_ );
- utf8::encode($html) unless utf8::is_utf8($html);
+ utf8::encode($html) unless Encode::is_utf8($html);
my ($errorblock,$newline) = _tidy_messages( $html,
$self->{config_file},
$self->{tidy_options}
@@ -304,7 +305,7 @@
}
my $text = join( '', @_ );
- utf8::encode($text) unless utf8::is_utf8($text);
+ utf8::encode($text) unless Encode::is_utf8($text);
if ( defined $text ) {
$text .= "\n";
}
--- HTML-Tidy-1.50/t/unicode.t 2010-02-16 15:48:40.000000000 +0000
+++ HTML-Tidy-1.50/t/unicode.t 2010-02-22 15:04:37.023464001 +0000
@@ -4,9 +4,14 @@
use warnings;
use strict;
-use Test::More tests => 7;
+use Encode;
+use Test::More;
+# utf8:is_utf8 only available from perl 5.8.1
BEGIN {
+ plan skip_all => "This test requires perl >= 5.8.1"
+ if $] < 5.008001;
+ plan tests => 7;
use_ok( 'HTML::Tidy' );
}
@@ -29,6 +34,8 @@
ok(utf8::is_utf8($reference), 'reference is utf8');
my $clean = $tidy->clean( $html );
+# Need to manually reconvert to utf8 with perl < 5.8.5 (CPAN RT#17451)
+$clean = Encode::decode_utf8($clean, Encode::FB_CROAK) if $^V lt 5.8.5;
ok(utf8::is_utf8($clean), 'cleaned output is also unicode');
$clean =~ s/"HTML Tidy.+w3\.org"/"Tidy"/;