Skip Menu |

This queue is for tickets about the HTML-HTML5-Parser CPAN distribution.

Report information
The Basics
Id: 79019
Status: resolved
Priority: 0/
Queue: HTML-HTML5-Parser

People
Owner: perl [...] toby.ink
Requestors: karavelov [...] mail.bg
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.206
Fixed in: (no value)



Subject: Failure mode of TagSoupParser
The parser dies when trying to parse broken xhtml with namespaced attributes. This is around line 2529. Putting the condition in 'eval' fixes the problem for me.
On 2012-08-16T15:47:33+01:00, KARAVELOV wrote: Show quoted text
> The parser dies when trying to parse broken xhtml with namespaced > attributes. This is around > line 2529. Putting the condition in 'eval' fixes the problem for me.
Do you have an example document that triggers the failure? Can you attach it to this bug report?
Subject: RE: [rt.cpan.org #79019] Failure mode of TagSoupParser
Date: Sat, 18 Aug 2012 13:54:29 +0300
To: bug-HTML-HTML5-Parser [...] rt.cpan.org
From: karavelov [...] mail.bg
----- Цитат от Toby Inkster via RT (bug-HTML-HTML5-Parser@rt.cpan.org), на 18.08.2012 в 09:54 ----- Show quoted text
>>On 2012-08-16T15:47:33+01:00, KARAVELOV wrote: >>The parser dies when trying to parse broken xhtml with namespaced >>attributes. This is around >>line 2529. Putting the condition in 'eval' fixes the problem for me.
Show quoted text
>Do you have an example document that triggers the failure? Can you attach >it to this bug report?
Here is my test case: perl -MURI -MHTML::HTML5::Parser -E ' my $uri = URI->new("http://www.blitz.bg/news/article/151210"); my $parser = HTML::HTML5::Parser->new; my $doc=$parser->parse_html_file($uri);' And here is the error in TagSoupParsers NAMESPACE ERROR: Attribute without a prefix cannot be in a namespace at /usr/share/perl5/HTML/HTML5/Parser/TagSoupParser.pm line 2524 All the articles at www.blitz.bg are severely broken. The error is on the second line "html xmlns:fb=...." Attached is a minimal test case document. -- Luben Karavelov

Message body is not shown because sender requested not to inline it.

Confirmed. I'll try to sort out a fix for this in the next few days. Your suggestion of wrapping the offending line in an eval is noted, but if possible I'd like to address the underlying cause.
On 2012-08-18T15:56:55+01:00, TOBYINK wrote: Show quoted text
> Confirmed. I'll try to sort out a fix for this in the next few days. > > Your suggestion of wrapping the offending line in an eval is noted, but > if possible I'd like to address the underlying cause.
I've just uploaded a development release (0.207_01) to CPAN. It seems to work both for the minimal test document, plus the blitz.bg page. https://metacpan.org/release/TOBYINK/HTML-HTML5-Parser-0.207_01 If you have the time, please give it a try and let me know if it works for you. Assuming all is well, a stable 0.208 will be out in a few days.