On 06/21/2011 12:53 PM, xmltwig@gmail.com via RT wrote:
Show quoted text> <URL:
https://rt.cpan.org/Ticket/Display.html?id=68976>
>
> On 06/21/2011 11:56 AM, Stefan Hornburg via RT wrote:
>> Tue Jun 21 05:56:02 2011: Request 68976 was acted upon.
>> Transaction: Ticket created by HORNBURG
>> Queue: XML-Twig
>> Subject: safe_parsefile_html method crashes on UTF8 text
>> Broken in: (no value)
>> Severity: (no value)
>> Owner: Nobody
>> Requestors: racke@linuxia.de
>> Status: new
>> Ticket<URL:
https://rt.cpan.org/Ticket/Display.html?id=68976>
>>
>>
>> My HTML document contains the following text:
>>
>> Copyright © 2011
>>
>> Parsing this document with safe_parsefile_html results in the following
>> error:
>>
>> Parsing of undecoded UTF-8 will give garbage when decoding entities at
>> /home/racke/perl5/perlbrew/perls/perl-5.12.3/lib/site_perl/5.12.3
>>
>> Adding
>>
>> $tree->utf8_mode(1);
>>
>> as workaround to _html2xml alleviates the problem.
>
> I can't seem to be able to reproduce this bug.
>
> perl -MXML::Twig -e'use strict; use warnings; use utf8; my $t=
> XML::Twig->new->safe_parse_html( "<html><body><p>Copyright © 2011") or
> die "$@"; $t->print'
That should be fine as the string is already UTF-8, which is uncertain
when you using files.
Show quoted text>
> outputs the document just fine. If I put the html in a file and use
> safe_parsefile_html, no problem either.
Well, that is different here :-/.
Show quoted text>
> Which version of HTML::TreeBuilder are you using?
>
perl -MHTML::TreeBuilder -le 'print $HTML::TreeBuilder::VERSION'
4.2
Regards
Racke
--
LinuXia Systems =>
http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP =>
http://www.icdevgroup.org/
Interchange Development Team