Bug #86633 for XML-Twig: safe_parse_html fails to parse valid HTML with entities (in some cases)

Tue Jul 02 08:47:40 2013 melmothx [...] gmail.com - Ticket created

Subject:	safe_parse_html fails to parse valid HTML with entities (in some cases)
Date:	Tue, 02 Jul 2013 14:46:23 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

Hello there! It looks like that entities (at least the very common '&') is mangled if it's followed by a letter. The test script below illustrates the problem, which contains perfectly valid HTML snippets. While testing, I found that adding to the method "_html2xml" this option: $tree->no_expand_entities(1); seems to fix the problem, but I'm not sure at all it will not trigger other problems or undesired behaviour. It's also possible the bug resides in HTML::TreeBuilder, but this I leave to you to decide. Best wishes Version used: XML::Twig is up to date. (3.44) HTML::TreeBuilder is up to date. (5.03) #!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Test::More; plan tests => 2; my $parser = new XML::Twig (); my $value =<< 'EOF'; <h1>Here&there</h1> EOF my $html = $parser->safe_parse_html($value); print $@ if $@; ok($html); $value =<< 'EOF'; <h1>Here & there</h1> EOF $html = $parser->safe_parse_html($value); print $@ if $@; ok($html); __END__ -- Marco

Tue Jul 02 09:21:31 2013 xmltwig [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
Date:	Tue, 02 Jul 2013 13:21:11 +0000
To:	bug-XML-Twig [...] rt.cpan.org
From:	mirod <xmltwig [...] gmail.com>

It does look like an HTML::Element bug, because if you use HTML::Tidy as the html to xml converter the bug disappears. my $parser = new XML::Twig ( use_tidy => 1); and voilà! I will look into this though. I have never been really happy with HTML::Element's xml conversion, but since HTML::Tidy is a bit more of pain to install (or was at least), HTML::TreeBuilder is still the default option. Plus changing this would likely cause back-compatibility problems. -- michel On 07/02/2013 12:47 PM, Marco Pessotto via RT wrote: Show quoted text

> Tue Jul 02 08:47:40 2013: Request 86633 was acted upon. > Transaction: Ticket created by melmothx@gmail.com > Queue: XML-Twig > Subject: safe_parse_html fails to parse valid HTML with entities (in some cases) > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: melmothx@gmail.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=86633 > > > > > Hello there! > > It looks like that entities (at least the very common '&') is > mangled if it's followed by a letter. The test script below illustrates > the problem, which contains perfectly valid HTML snippets. > > While testing, I found that adding to the method "_html2xml" this option: > > $tree->no_expand_entities(1); > > seems to fix the problem, but I'm not sure at all it will not trigger > other problems or undesired behaviour. > > It's also possible the bug resides in HTML::TreeBuilder, but this I > leave to you to decide. > > Best wishes > > > Version used: > XML::Twig is up to date. (3.44) > HTML::TreeBuilder is up to date. (5.03) > > > #!/usr/bin/env perl > > use strict; > use warnings; > use XML::Twig; > use Test::More; > plan tests => 2; > > my $parser = new XML::Twig (); > my $value =<< 'EOF'; > <h1>Here&there</h1> > EOF > > > my $html = $parser->safe_parse_html($value); > print $@ if $@; > ok($html); > > $value =<< 'EOF'; > <h1>Here & there</h1> > EOF > > $html = $parser->safe_parse_html($value); > print $@ if $@; > ok($html); > > __END__ > >

Tue Jul 02 09:21:31 2013 The RT System itself - Status changed from 'new' to 'open'

Tue Jul 02 09:59:17 2013 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
Date:	Tue, 02 Jul 2013 15:57:47 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

"xmltwig@gmail.com via RT" <bug-XML-Twig@rt.cpan.org> writes: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=86633 > > > It does look like an HTML::Element bug, because if you use HTML::Tidy as > the html to xml converter the bug disappears. > > my $parser = new XML::Twig ( use_tidy => 1); > > and voilà! > > I will look into this though. I have never been really happy with > HTML::Element's xml conversion, but since HTML::Tidy is a bit more of > pain to install (or was at least), HTML::TreeBuilder is still the > default option. Plus changing this would likely cause back-compatibility > problems.

This indeed seems to fix the issue. Thanks a lot for the amazing fast reply! Best wishes -- Marco

Wed Jul 03 14:55:07 2013 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
Date:	Wed, 03 Jul 2013 20:53:35 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

"xmltwig@gmail.com via RT" <bug-XML-Twig@rt.cpan.org> writes: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=86633 > > > It does look like an HTML::Element bug, because if you use HTML::Tidy as > the html to xml converter the bug disappears. > > my $parser = new XML::Twig ( use_tidy => 1); > > and voilà! > > I will look into this though. I have never been really happy with > HTML::Element's xml conversion, but since HTML::Tidy is a bit more of > pain to install (or was at least), HTML::TreeBuilder is still the > default option. Plus changing this would likely cause back-compatibility > problems.

Actually, tidy has other issues and seems to eat (at least) the style attributes: I've updated the test script to look so: #!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Test::More; plan tests => 6; my $tidy_parser = new XML::Twig ( use_tidy => 1); my $default_parser = new XML::Twig; my $value =<< 'EOF'; <h1>Here&there</h1> EOF my $html = $tidy_parser->safe_parse_html($value); print $@ if $@; ok($html, "tidy ok"); $html = $default_parser->safe_parse_html($value); print $@ if $@; ok($html, "default ok"); $value =<< 'EOF'; <h1 style="display:none">Here & there</h1> EOF $html = $default_parser->safe_parse_html($value); print $@ if $@; ok($html); $html = $tidy_parser->safe_parse_html($value); print $@ if $@; ok($html); $html = $tidy_parser->safe_parse_html($value); my @elts = $html->root()->get_xpath("//body"); is($elts[0]->first_child->{att}->{style}, "display:none", "style found with tidy converter"); $html = $default_parser->safe_parse_html($value); @elts = $html->root()->get_xpath("//body"); is($elts[0]->first_child->{att}->{style}, "display:none", "style found with default converter"); __END__ Best wishes -- Marco

Thu Jul 04 04:45:31 2013 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
Date:	Thu, 04 Jul 2013 10:43:59 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

"xmltwig@gmail.com via RT" <bug-XML-Twig@rt.cpan.org> writes: More info. This seems to happen only with the latest versions of Twig. Using the version provided by debian stable and running the provided test script I get: 1..6 ok 1 - tidy ok ok 2 - default ok ok 3 ok 4 not ok 5 - style found with tidy converter # Failed test 'style found with tidy converter' # at t/twig.t line 39. # got: undef # expected: 'display:none' ok 6 - style found with default converter Then upgrading HTML::Tree from 5.02 to 5.03 the test still works. Upgrading Twig from 3.39 (debian version) to 3.44 fails. I've tried to diff the two version, but I guess you're WAY more qualified than me to spot the bug. Best wishes. -- Marco

Mon Jul 22 12:18:35 2013 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
Date:	Mon, 22 Jul 2013 18:18:15 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

Further investigations led to the routine _xml_escape which is called by the local fork of HTML::Element::as_XML. _xml_escape doesn't escape entities "already escaped". But the XML parser is set to expand the entities, so we work with just plain text here. sub _xml_escape { my( $html)= @_; $html =~ s{&(?! # An ampersand that isn't followed by... ( \#[0-9]+; | # A hash mark, digits and semicolon, or \#x[0-9a-fA-F]+; | # A hash mark, "x", hex digits and semicolon, or [\w]+ # A valid unicode entity name and semicolon ) ) } {&}gx; # Needs to be escaped to amp In first place, the regexp seems wrong on the 4th line, because the semicolon seems missing. But if the parser returns us the string "&", how can we know if the source was "&amp;" or just "&" or "&"? Also, there is no guarantee that the unicode entity name is valid. Then if we have the string "by Marco&company" returned by the parser, which originated from the following legal string: "Marco&company", we get an invalid entity unescaped on which the parser will crash badly. To me the _xml_escape should looks so (IMVHO): sub _xml_escape { my( $html)= @_; # entities are already expanded in the treebuilder, so just escape them # simple character escapes # warn "escaping $html"; $html =~ s/&/&/g; $html =~ s/</</g; $html =~ s/>/>/g; $html =~ s/"/"/g; $html =~ s/'/'/g; # warn "returning $html"; return $html; } Of course, this could break something else I'm not aware of (unsure about the CDATA sections). Test script: use strict; use warnings; use XML::Twig; use Test::More; use HTML::TreeBuilder; plan tests => 2; # emulate the tree builder with the same options my $tree= HTML::TreeBuilder->new; $tree->ignore_ignorable_whitespace( 0); $tree->ignore_unknown( 0); $tree->no_space_compacting( 1); $tree->store_comments( 1); $tree->store_pis(1); $tree->parse("<h1>Marco&company</h1>"); $tree->eof; my $tb = $tree->as_XML; is ($tb, "<html><head></head><body><h1>Marco&company</h1></body></html>\n"); diag "Expecting: $tb;"; my $parser = XML::Twig->new(); my $html = $parser->safe_parse_html("<h1>Marco&company</h1>"); diag $@ if $@; is($html->sprint . "\n", $tb, "treebuilder and twig yield the same (with trailing linefeed)"); __END__ I hope this help. Best wishes -- Marco Pessotto

Tue Apr 01 05:56:59 2014 MIROD [...] cpan.org - Status changed from 'open' to 'resolved'

Tue Apr 01 05:57:00 2014 MIROD [...] cpan.org - Broken in 3.48 added