Bug #73470 for HTML-Zoom: Bug in built-in HTML parser -- loses text

Sat Dec 24 14:47:56 2011 m-rt.cpan.org-98jw3v [...] lexoid.com - Ticket created

CC:	jfm+filecopy [...] lexoid.com
Subject:	Bug in built-in HTML parser -- loses text
Date:	Sat, 24 Dec 2011 13:47:44 -0600
To:	bug-HTML-Zoom [...] rt.cpan.org
From:	Jim Miner <m-rt.cpan.org-98jw3v [...] lexoid.com>

Bug report for HTML-Zoom-0.009006 / perl v5.8.8 / (OSX 10.5 & RHEL 4) The built-in HTML parser causes input text to be lost by HTML::Zoom under some conditions. - The entire input is lost if the input contains no tag. - Leading text (before the first tag) is lost if the first tag is modified, e.g., by replace_attribute. This is a problem when, e.g., replacing content with an HTML fragment. Below find: - script demonstrating the bug. - output of the script. - patch. Show quoted text

-------------- script -------------- #!/usr/bin/perl use strictures 1; use HTML::Zoom; my @data = ( 'text', 'text<tag a="x">', ); print "--- pass-through ---\n"; foreach my $in ( @data ) { my $out = HTML::Zoom->from_html($in)->to_html; print "in: $in\n", "out: $out\n", "\n"; } print "--- remove_attribute('a') ---\n"; foreach my $in ( @data ) { my $z = HTML::Zoom->from_html($in); $z = $z->select('tag')->remove_attribute('a'); my $out = $z->to_html; print "in: $in\n", "out: $out\n", "\n"; }

-------------- script output -------------- --- pass-through --- in: text out: in: text<tag a="x"> out: text<tag a="x"> --- remove_attribute('a') --- in: text out: in: text<tag a="x"> out: <tag>

-------------- patch -------------- *** HTML-Zoom-0.009006/lib/HTML/Zoom/Parser/BuiltIn.pm 2011-03-27 09:23:14.000000000 -0500 --- HTML-Zoom-0.009006/lib/HTML/Zoom/Parser/BuiltIn-PATCHED.pm 2011-12-16 00:26:18.000000000 -0600 *************** *** 18,23 **** --- 18,27 ---- sub _hacky_tag_parser { my ($text, $handler) = @_; + $text =~ m{^([^<]*)}g; + if ( length $1 ) { # leading PCDATA + $handler->({ type => 'TEXT', raw => $1 }); + } while ( $text =~ m{ ( *************** *** 109,111 **** --- 113,122 ---- sub html_unescape { _simple_unescape($_[1]) } 1; + + __END__ + + Modification 2011-12-15 by Jim Miner + Don't throw away leading PCDATA in $text, in _hacky_tag_parser(). + This is important so we can use from_html and replace_content to + insert fragments with or without markup into templates.

Sun Feb 24 18:06:35 2013 cpan [...] papercreatures.com - Taken

Thu Feb 28 07:09:17 2013 cpan [...] papercreatures.com - Status changed from 'new' to 'resolved'