On Sun Apr 25 03:41:26 2010, AVIAN wrote:
Show quoted text> I'm afraid the latest version from the git repository is still broken.
>
> In the test case I gave above as_XML() still returns invalid XML from a
> valid input.
The XML is valid, it is limited by the way HTML::Parser handles
entities. HTML::Parser will decode the '&' to '&' when it gets
parsed, so the tree HTML::TreeBuilder is seeing has the text '&foo;'.
The HTML output encodes output so the ampersand gets converted to &,
however &foo; is perfectly valid XML, so it gets output that way.
To stop HTML::Parser decoding the entities, and allow as_XML to see the
full input, you need to stop HTML::Parser from decoding entities. I
added ignore_entities option to allow this. Perhaps there is a better name?
perl -MHTML::TreeBuilder -E 'my $tree =
HTML::TreeBuilder->new(ignore_entities => 1); $tree->parse("<p>&foo;
&bar; ' &l</p>");print($tree->as_XML(), "\n");'
<html><head></head><body><p>&foo; &bar; ' &l</p></body></html>
The last ampersand is escaped since it doesn't have a trailing semicolon
and would be invalid.
Show quoted text> What I'm trying to tell here is that the current approach to XML entity
> references in as_XML() is severely broken. You cannot guess with a regex
> whether an & character in the "text" element must be escaped on output
> (i.e. is part of a proper "text" element) or not (is part of an
> unresolved entity reference). This is always doomed to fail - even if
> you manage to fix the regex in some way so that my test case works, I
> will always be able to find another case where it doesn't.
>
> A proper fix would:
>
> a) Properly detect XML entity references while parsing the input and
> store them as separate tree elements - all sane XML libraries do this.
This isn't an XML library, it's an HTML library. Converting HTML to XML
is always going to be inelegant. I think the option of allowing the user
to turn of entity decoding is as tidy as it's going to get.
Show quoted text> OR
>
> b) Say outright that you don't support XML entities beyond the trivial
> character entities. This is what my proposed patch does and since this
> is a library for dealing with HTML/XHTML documents it's probably a
> usable compromise.
Your patch re-encodes perfectly valid entities and turns them in to
string representations of the actual entity, which is deeply flawed and
would break a lot of current use of the module.
e.g. if my HTML has '1&2' then I want that rendered as '1&2' not
'1&2', since your patch converts '1&2' to '1&#38;2' it will
be rendered incorrectly.
Furthermore if I parsed a file, output it to another file, then parsed
the second file sometime later, the entities would be escaped again.
e.g. first run 1&2 gets converted to 1&#38;2, second run
1&#38;2 gets converted to 1&amp;#38;2
This is why I added the triple encoding check to t/escape.t, once
encoded correctly it should not be changed.