Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 68976
Status: open
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: racke [...] linuxia.de
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: safe_parsefile_html method crashes on UTF8 text
My HTML document contains the following text: Copyright © 2011 Parsing this document with safe_parsefile_html results in the following error: Parsing of undecoded UTF-8 will give garbage when decoding entities at /home/racke/perl5/perlbrew/perls/perl-5.12.3/lib/site_perl/5.12.3 Adding $tree->utf8_mode(1); as workaround to _html2xml alleviates the problem. Regards Racke
Subject: Re: [rt.cpan.org #68976] safe_parsefile_html method crashes on UTF8 text
Date: Tue, 21 Jun 2011 12:51:51 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
On 06/21/2011 11:56 AM, Stefan Hornburg via RT wrote: Show quoted text
> Tue Jun 21 05:56:02 2011: Request 68976 was acted upon. > Transaction: Ticket created by HORNBURG > Queue: XML-Twig > Subject: safe_parsefile_html method crashes on UTF8 text > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: racke@linuxia.de > Status: new > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=68976> > > > My HTML document contains the following text: > > Copyright © 2011 > > Parsing this document with safe_parsefile_html results in the following > error: > > Parsing of undecoded UTF-8 will give garbage when decoding entities at > /home/racke/perl5/perlbrew/perls/perl-5.12.3/lib/site_perl/5.12.3 > > Adding > > $tree->utf8_mode(1); > > as workaround to _html2xml alleviates the problem.
I can't seem to be able to reproduce this bug. perl -MXML::Twig -e'use strict; use warnings; use utf8; my $t= XML::Twig->new->safe_parse_html( "<html><body><p>Copyright © 2011") or die "$@"; $t->print' outputs the document just fine. If I put the html in a file and use safe_parsefile_html, no problem either. Which version of HTML::TreeBuilder are you using? -- mirod
Subject: Re: [rt.cpan.org #68976] safe_parsefile_html method crashes on UTF8 text
Date: Tue, 21 Jun 2011 12:59:26 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: "Stefan Hornburg (Racke)" <racke [...] linuxia.de>
On 06/21/2011 12:53 PM, xmltwig@gmail.com via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=68976> > > On 06/21/2011 11:56 AM, Stefan Hornburg via RT wrote:
>> Tue Jun 21 05:56:02 2011: Request 68976 was acted upon. >> Transaction: Ticket created by HORNBURG >> Queue: XML-Twig >> Subject: safe_parsefile_html method crashes on UTF8 text >> Broken in: (no value) >> Severity: (no value) >> Owner: Nobody >> Requestors: racke@linuxia.de >> Status: new >> Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=68976> >> >> >> My HTML document contains the following text: >> >> Copyright © 2011 >> >> Parsing this document with safe_parsefile_html results in the following >> error: >> >> Parsing of undecoded UTF-8 will give garbage when decoding entities at >> /home/racke/perl5/perlbrew/perls/perl-5.12.3/lib/site_perl/5.12.3 >> >> Adding >> >> $tree->utf8_mode(1); >> >> as workaround to _html2xml alleviates the problem.
> > I can't seem to be able to reproduce this bug. > > perl -MXML::Twig -e'use strict; use warnings; use utf8; my $t= > XML::Twig->new->safe_parse_html( "<html><body><p>Copyright © 2011") or > die "$@"; $t->print'
That should be fine as the string is already UTF-8, which is uncertain when you using files. Show quoted text
> > outputs the document just fine. If I put the html in a file and use > safe_parsefile_html, no problem either.
Well, that is different here :-/. Show quoted text
> > Which version of HTML::TreeBuilder are you using? >
perl -MHTML::TreeBuilder -le 'print $HTML::TreeBuilder::VERSION' 4.2 Regards Racke -- LinuXia Systems => http://www.linuxia.de/ Expert Interchange Consulting and System Administration ICDEVGROUP => http://www.icdevgroup.org/ Interchange Development Team