Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 76021
Status: open
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: 'spro^^*%*^6ut# [...] &$%*c
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bad parsing of <form><body>
Yes, I know this is malformed HTML, but what HTML::Tree does is different from Safari, FireFox and HTML::HTML5::Parser: $ perl -MHTML::TreeBuilder -le '$h = new HTML::TreeBuilder; $h- Show quoted text
>parse("<form><body><input name=foo value=bar>"); print $h->as_HTML'
<html><head></head><body><form></form><input name="foo" value="bar" /></body></html> Notice how the input ends up outside the form. This is because, when the extraneous <body> tag is encountered, the current parsing/insertion position is set to the body element. According to the HTML 5 specification (if you can call it that yet), the current insertion position (‘stack of open elements’) does not change when an extraneous <body> tag is encountered. It’s merely the attributes in the tag that get copied to the existing body element. The attached patch fixes it, at least for this case (<body> when pos is already inside body). I made the fix conditional, just to keep the existing behaviour the same for other cases that I haven’t thought about.
Subject: open_U1t48IiP.txt
Only in HTML-Tree-4.2-SxenFd: .DS_Store diff -rup HTML-Tree-4.2-SxenFd-orig/lib/HTML/TreeBuilder.pm HTML-Tree-4.2-SxenFd/lib/HTML/TreeBuilder.pm --- HTML-Tree-4.2-SxenFd-orig/lib/HTML/TreeBuilder.pm 2011-04-06 01:37:54.000000000 -0700 +++ HTML-Tree-4.2-SxenFd/lib/HTML/TreeBuilder.pm 2012-03-24 14:29:11.000000000 -0700 @@ -706,7 +706,8 @@ sub warning { for ( keys %$attr ) { $body->attr( $_, $attr->{$_} ); } - return $self->{'_pos'} = $body; # bypass tweaking. + $self->{'_pos'} = $body unless $pos->is_inside('body'); + return $self->{'_pos'}; # bypass tweaking. #---------------------------------------------------------------------- } diff -rup HTML-Tree-4.2-SxenFd-orig/t/body.t HTML-Tree-4.2-SxenFd/t/body.t --- HTML-Tree-4.2-SxenFd-orig/t/body.t 2011-04-06 01:37:54.000000000 -0700 +++ HTML-Tree-4.2-SxenFd/t/body.t 2012-03-24 14:28:53.000000000 -0700 @@ -3,7 +3,7 @@ use warnings; use strict; -use Test::More tests => 11; +use Test::More tests => 12; BEGIN { use_ok('HTML::TreeBuilder'); @@ -89,3 +89,10 @@ RT_18571: { "<html><head></head><body><b>\$self->escape</b></body></html>" ) ; # 3.22 compatability } + +{ + my $root = HTML::TreeBuilder->new; + $root->parse('<form><body><input>'); + ok $root->find('input')->is_inside('form'), + '<form><body> leaves <form> as the current parsing position'; +}
On Sat Mar 24 17:34:53 2012, SPROUT wrote: Show quoted text
> The attached patch fixes it, at least for this case
Actually, it should probably return $body, rather than $self->{_pos}. The actual return value of start is not used by HTML::TreeBuilder itself. I suppose it must be there for subclasses to be able to do something to the element after calling $elem = $self->SUPER::start(@_), in which case $body is appropriate in this case.