Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 83758
Status: rejected
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: kamelkev [...] mailermailer.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 4.2
Fixed in: (no value)



Subject: HTML-Tree improperly tagging strings as UTF8
Hi, I give the module an ASCII string via the parse method. I then perform "as_HTML" and receive a UTF8 string which contains no UTF8 characters. This is very counter intuitive - I would expect the output string to be encoded identically as the input string, especially if the resulting output content is identical to the input content. thanks, Kevin Kamel MailerMailer LLC
Subject: test.pl
#!/usr/bin/perl -w use strict; use Data::Dumper; use HTML::TreeBuilder; my $badstring = '<html><head></head><body><span>Text: &#x641;</span></body></html>'; my $parser = HTML::TreeBuilder->new(); $parser->store_comments(1); $parser->parse($badstring); my $string = $parser->as_HTML(undef," ",{}); print $string . "\n"; if (utf8::is_utf8($string)) { print "I AM BROKEN!\n"; }
You are giving too much weight to the utf8 flag. That's an internal implementation detail of the way Perl 5 stores strings. You had a non-ASCII character included as an entity reference. During parsing, that reference was converted to the actual character. To do that, Perl needed to store the string with UTF-8. When as_HTML re-encodes the string, it still has the utf8 flag set. But that makes no difference to the meaning of the string.