Subject: | HTML::TreeBuilder parses text-only HTML improperly without trailing whitespace |
Date: | Thu, 8 Dec 2016 14:29:53 -0500 |
To: | bug-HTML-Tree [...] rt.cpan.org |
From: | Jon Rubin <jon.rubin [...] grantstreet.com> |
When attempting to parse HTML consisting of only text, and no trailing
whitespace, HTML::TreeBuilder returns incorrect results:
# No whitespace
1. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b =
HTML::TreeBuilder->new; $b->parse("text"); dd $b->guts;'
()
# Trailing whitespace
2. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b =
HTML::TreeBuilder->new; $b->parse("text "); dd $b->guts;'
"text"
# Leading whitespace
3. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b =
HTML::TreeBuilder->new; $b->parse(" text"); dd $b->guts;'
()
# Middle whitespace
4. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b =
HTML::TreeBuilder->new; $b->parse("text more"); dd $b->guts;'
"text"
# Middle and Trailing whitespace
5. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b =
HTML::TreeBuilder->new; $b->parse("text text "); dd $b->guts;'
Cases 1, 3, and 4 show omissions from the returned text, but adding
trailing whitespace to them corrects the problem.
Unfortunately my XS-fu is not up to snuff enough to provide a patch.
Distribution: HTML-Tree-5.03
Perl Version: v5.22.2
OS: Linux/Centos6, more specifically:
]$ uname -a
Linux pexdev002-dev3.grantstreet.com 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed
Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Thanks in advance!
Jon
--
Jon Rubin
Grant Street Group
Ph: (412) 391-5555, Ext. 1323