CC: | jfm+filecopy [...] lexoid.com |
Subject: | Bug in built-in HTML parser -- loses text |
Date: | Sat, 24 Dec 2011 13:47:44 -0600 |
To: | bug-HTML-Zoom [...] rt.cpan.org |
From: | Jim Miner <m-rt.cpan.org-98jw3v [...] lexoid.com> |
Bug report for HTML-Zoom-0.009006 / perl v5.8.8 / (OSX 10.5 & RHEL 4)
The built-in HTML parser causes input text to be lost by HTML::Zoom under
some conditions.
- The entire input is lost if the input contains no tag.
- Leading text (before the first tag) is lost if the first tag is
modified, e.g., by replace_attribute.
This is a problem when, e.g., replacing content with an HTML fragment.
Below find:
- script demonstrating the bug.
- output of the script.
- patch.
Show quoted text
-------------- script --------------
#!/usr/bin/perl
use strictures 1;
use HTML::Zoom;
my @data = (
'text',
'text<tag a="x">',
);
print "--- pass-through ---\n";
foreach my $in ( @data ) {
my $out = HTML::Zoom->from_html($in)->to_html;
print "in: $in\n", "out: $out\n", "\n";
}
print "--- remove_attribute('a') ---\n";
foreach my $in ( @data ) {
my $z = HTML::Zoom->from_html($in);
$z = $z->select('tag')->remove_attribute('a');
my $out = $z->to_html;
print "in: $in\n", "out: $out\n", "\n";
}
-------------- script output --------------
--- pass-through ---
in: text
out:
in: text<tag a="x">
out: text<tag a="x">
--- remove_attribute('a') ---
in: text
out:
in: text<tag a="x">
out: <tag>
-------------- patch --------------
*** HTML-Zoom-0.009006/lib/HTML/Zoom/Parser/BuiltIn.pm 2011-03-27 09:23:14.000000000 -0500
--- HTML-Zoom-0.009006/lib/HTML/Zoom/Parser/BuiltIn-PATCHED.pm 2011-12-16 00:26:18.000000000 -0600
***************
*** 18,23 ****
--- 18,27 ----
sub _hacky_tag_parser {
my ($text, $handler) = @_;
+ $text =~ m{^([^<]*)}g;
+ if ( length $1 ) { # leading PCDATA
+ $handler->({ type => 'TEXT', raw => $1 });
+ }
while (
$text =~ m{
(
***************
*** 109,111 ****
--- 113,122 ----
sub html_unescape { _simple_unescape($_[1]) }
1;
+
+ __END__
+
+ Modification 2011-12-15 by Jim Miner
+ Don't throw away leading PCDATA in $text, in _hacky_tag_parser().
+ This is important so we can use from_html and replace_content to
+ insert fragments with or without markup into templates.