Bug #55835 for HTML-Tree: as_XML can create invalid escapes

Tue Mar 23 06:49:40 2010 tomaz.solc [...] tablix.org - Ticket created

Subject:

as_XML can create invalid escapes

Hi as_XML sometimes forgets to escape the & character. The _xml_escape function seems to have a special case that causes that. Is there a reason for this "feature"? If & was escaped in the input, it should also be escaped in the output. Thanks Tomaž Example: use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content("&foo;"); print $tree->as_XML(); # Wrong: # <html><head></head><body>&foo;</body></html> print $tree->as_HTML(); # Correct: # <html><head></head><body>&foo;</body></html>

Thu Apr 08 04:56:29 2010 tomaz.solc [...] tablix.org - Correspondence added

I'm attaching a patch that I've been using to fix this issue. It also makes more human-friendly escapes in as_XML output.

Subject:

tomaz-fix-xml-escape

Download tomaz-fix-xml-escape
application/octet-stream 5.5k

Message body not shown because it is not plain text.

Thu Apr 08 04:56:30 2010 tomaz.solc [...] tablix.org - Status changed from 'new' to 'open'

Sat Apr 24 01:07:12 2010 Jeff.Fearn [...] gmail.com - Correspondence added

Hi Tomaž, thanks for the patch, unfortunately it was escaping content that had already been escaped, which is wrong. I've applied a patch that escapes un-escaped content correctly and which does not affect escaped content. http://github.com/jfearn/HTML-Tree

Sat Apr 24 01:07:13 2010 Jeff.Fearn [...] gmail.com - Status changed from 'open' to 'patched'

Sat Apr 24 01:07:13 2010 Jeff.Fearn [...] gmail.com - Given to jfearn

Sat Apr 24 13:41:26 2010 tomaz.solc [...] tablix.org - Correspondence added

I'm afraid the latest version from the git repository is still broken. In the test case I gave above as_XML() still returns invalid XML from a valid input. What I'm trying to tell here is that the current approach to XML entity references in as_XML() is severely broken. You cannot guess with a regex whether an & character in the "text" element must be escaped on output (i.e. is part of a proper "text" element) or not (is part of an unresolved entity reference). This is always doomed to fail - even if you manage to fix the regex in some way so that my test case works, I will always be able to find another case where it doesn't. A proper fix would: a) Properly detect XML entity references while parsing the input and store them as separate tree elements - all sane XML libraries do this. OR b) Say outright that you don't support XML entities beyond the trivial character entities. This is what my proposed patch does and since this is a library for dealing with HTML/XHTML documents it's probably a usable compromise.

Sat Apr 24 13:41:28 2010 The RT System itself - Status changed from 'patched' to 'open'

Sat Apr 24 21:57:23 2010 Jeff.Fearn [...] gmail.com - Correspondence added

On Sun Apr 25 03:41:26 2010, AVIAN wrote: Show quoted text

> I'm afraid the latest version from the git repository is still broken. > > In the test case I gave above as_XML() still returns invalid XML from a > valid input.

The XML is valid, it is limited by the way HTML::Parser handles entities. HTML::Parser will decode the '&' to '&' when it gets parsed, so the tree HTML::TreeBuilder is seeing has the text '&foo;'. The HTML output encodes output so the ampersand gets converted to &, however &foo; is perfectly valid XML, so it gets output that way. To stop HTML::Parser decoding the entities, and allow as_XML to see the full input, you need to stop HTML::Parser from decoding entities. I added ignore_entities option to allow this. Perhaps there is a better name? perl -MHTML::TreeBuilder -E 'my $tree = HTML::TreeBuilder->new(ignore_entities => 1); $tree->parse("<p>&foo; &bar; ' &l</p>");print($tree->as_XML(), "\n");' <html><head></head><body><p>&foo; &bar; ' &l</p></body></html> The last ampersand is escaped since it doesn't have a trailing semicolon and would be invalid. Show quoted text

> What I'm trying to tell here is that the current approach to XML entity > references in as_XML() is severely broken. You cannot guess with a regex > whether an & character in the "text" element must be escaped on output > (i.e. is part of a proper "text" element) or not (is part of an > unresolved entity reference). This is always doomed to fail - even if > you manage to fix the regex in some way so that my test case works, I > will always be able to find another case where it doesn't. > > A proper fix would: > > a) Properly detect XML entity references while parsing the input and > store them as separate tree elements - all sane XML libraries do this.

This isn't an XML library, it's an HTML library. Converting HTML to XML is always going to be inelegant. I think the option of allowing the user to turn of entity decoding is as tidy as it's going to get. Show quoted text

> OR > > b) Say outright that you don't support XML entities beyond the trivial > character entities. This is what my proposed patch does and since this > is a library for dealing with HTML/XHTML documents it's probably a > usable compromise.

Your patch re-encodes perfectly valid entities and turns them in to string representations of the actual entity, which is deeply flawed and would break a lot of current use of the module. e.g. if my HTML has '1&2' then I want that rendered as '1&2' not '1&2', since your patch converts '1&2' to '1&#38;2' it will be rendered incorrectly. Furthermore if I parsed a file, output it to another file, then parsed the second file sometime later, the entities would be escaped again. e.g. first run 1&2 gets converted to 1&#38;2, second run 1&#38;2 gets converted to 1&amp;#38;2 This is why I added the triple encoding check to t/escape.t, once encoded correctly it should not be changed.

Sat Apr 24 21:57:25 2010 Jeff.Fearn [...] gmail.com - Status changed from 'open' to 'patched'

Sun Apr 25 05:55:23 2010 tomaz.solc [...] tablix.org - Correspondence added

Show quoted text

> The XML is valid, it is limited by the way HTML::Parser handles > entities. HTML::Parser will decode the '&' to '&' when it gets > parsed, so the tree HTML::TreeBuilder is seeing has the text '&foo;'. > The HTML output encodes output so the ampersand gets converted to &, > however &foo; is perfectly valid XML, so it gets output that way.

The problem is that as far as HTML::TreeBuilder is concerned, "&foo;" and "&foo;" are equivalent. They should not be. Show quoted text

> Your patch re-encodes perfectly valid entities and turns them in to > string representations of the actual entity, which is deeply flawed and > would break a lot of current use of the module. > > e.g. if my HTML has '1&2' then I want that rendered as '1&2' not > '1&2', since your patch converts '1&2' to '1&#38;2' it will > be rendered incorrectly.

Have you actually tried my patch? Because it does not work that way. Try it: use HTML::TreeBuilder; $x1 = HTML::TreeBuilder->new_from_content("1&2"); $y1 = $x1->as_XML(); print $y1; $x2 = HTML::TreeBuilder->new_from_content("$y1"); $y2 = $x2->as_XML(); print $y2; <html><head></head><body>1&2</body></html> <html><head></head><body>1&2</body></html> Show quoted text

> Furthermore if I parsed a file, output it to another file, then parsed > the second file sometime later, the entities would be escaped again.

Yes! That's exactly the point I'm trying to make here! With my patch, that DOES NOT happen. With my patch "&foo;" remains "&foo;" after parsing and outputing. Compare that to your version, which converts "&foo;" into "&foo;", which has a completely different meaning.

Sun Apr 25 05:55:25 2010 The RT System itself - Status changed from 'patched' to 'open'

Tue May 04 06:51:45 2010 Jeff.Fearn [...] gmail.com - Correspondence added

On Sun Apr 25 19:55:23 2010, AVIAN wrote: Show quoted text

> > The XML is valid, it is limited by the way HTML::Parser handles > > entities. HTML::Parser will decode the '&' to '&' when it gets > > parsed, so the tree HTML::TreeBuilder is seeing has the text '&foo;'. > > The HTML output encodes output so the ampersand gets converted to &, > > however &foo; is perfectly valid XML, so it gets output that way.

> > The problem is that as far as HTML::TreeBuilder is concerned, > "&foo;" and "&foo;" are equivalent. They should not be.

This is true if you pass them through HTML::Parser and decode the entities, in HTML they ARE the same. '&foo;' in HTML has no meaning and is interpreted as an un-encoded string, where as the & is decoded to '&', leaving you with the exact same string. When it gets to being encoded this bug exhibits because the regex says "hey '&foo;' is valid XML, I'd better not touch that!" So it is wrong when you have mixed encoded and un-encoded content in HTML source. Conversely, your patch fails if you have real XML internal entities in the content you pass to as_XML. e.g. if I'm using XML::TreeBuilder and call as_XML with 'email me at <&email;>' then what I expect back is 'email me at <&email;>' where as your patch will give me 'email me at <&email;>' which is not correct. I'm looking at putting in a switch to change the behaviour, but to default to your behaviour, which seems reasonable to me, I'd need to get XML::TreeBuilder patched. I've tried to contact the maintainer to get added as a co-maintainer, but I haven't heard back from him yet. If I can't reach him then I'll proably default to my approach since it doesn't break the current version of XML::TreeBuilder. I'll let you know how my search goes :)

Tue Sep 21 22:38:00 2010 Jeff.Fearn [...] gmail.com - Correspondence added

Subject:

4.0 released

Hi HTML::Tree ve4rsion 4.0 has been released which includes a fix for this issue. Cheers, Jeff.

Tue Sep 21 22:38:01 2010 Jeff.Fearn [...] gmail.com - Status changed from 'open' to 'resolved'