Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 119186
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: jon.rubin [...] grantstreet.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: HTML::TreeBuilder parses text-only HTML improperly without trailing whitespace
Date: Thu, 8 Dec 2016 14:29:53 -0500
To: bug-HTML-Tree [...] rt.cpan.org
From: Jon Rubin <jon.rubin [...] grantstreet.com>
When attempting to parse HTML consisting of only text, and no trailing whitespace, HTML::TreeBuilder returns incorrect results: # No whitespace 1. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text"); dd $b->guts;' () # Trailing whitespace 2. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text "); dd $b->guts;' "text" # Leading whitespace 3. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse(" text"); dd $b->guts;' () # Middle whitespace 4. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text more"); dd $b->guts;' "text" # Middle and Trailing whitespace 5. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text text "); dd $b->guts;' Cases 1, 3, and 4 show omissions from the returned text, but adding trailing whitespace to them corrects the problem. Unfortunately my XS-fu is not up to snuff enough to provide a patch. Distribution: HTML-Tree-5.03 Perl Version: v5.22.2 OS: Linux/Centos6, more specifically: ]$ uname -a Linux pexdev002-dev3.grantstreet.com 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Thanks in advance! Jon -- Jon Rubin Grant Street Group Ph: (412) 391-5555, Ext. 1323
It's probably just buffered as HTML::Parser won't expect the input to stop there, try calling $b->eof before calling guts.
Subject: Re: [rt.cpan.org #119186] HTML::TreeBuilder parses text-only HTML improperly without trailing whitespace
Date: Mon, 12 Dec 2016 13:11:23 -0500
To: bug-HTML-Tree [...] rt.cpan.org
From: Jon Rubin <jon.rubin [...] grantstreet.com>
Ah, that fixes my problems. Is there a reason HTML::TreeBuilder lets me call guts at all when the tree is in an incomplete state? Is there a different accessor I should be calling instead of guts for that? Thanks, Jon On Mon, Dec 12, 2016 at 4:36 AM, Jeff Fearn via RT < bug-HTML-Tree@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=119186 > > > It's probably just buffered as HTML::Parser won't expect the input to stop > there, try calling $b->eof before calling guts. >
-- Jon Rubin Grant Street Group Ph: (412) 391-5555, Ext. 1323
Probably the correct method is new_from_content which will call eof. Not sure if there is a way to detect this as it's HTML::Parsers buffer that hasn;t been flushed not HTML::*'s