On 09/16/2011 11:47 AM, Stefan Hornburg via RT wrote:
Show quoted text> Fri Sep 16 05:47:37 2011: Request 71009 was acted upon.
> Transaction: Ticket created by HORNBURG
> Queue: XML-Twig
> Subject: Preserve doctype declaration from HTML documents
> Broken in: (no value)
> Severity: (no value)
> Owner: Nobody
> Requestors: racke@linuxia.de
> Status: new
> Ticket<URL:
https://rt.cpan.org/Ticket/Display.html?id=71009>
>
>
> I'm parsing HTML documents with XML::Twig starting with the
> following DOCTYPE declaration:
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
>
> But this declaration is missing from the output of $twig->sprint.
>
> Please consider changing the _html2xml method to carry over the
> declaration, e.g. replace
>
> my $xml= $tree->as_XML;
>
> with:
>
> my $xml;
>
> if (exists $tree->{_decl}) {
> $xml = $tree->{_decl}->as_XML . $tree->as_XML;
> }
> else {
> $xml = $tree->as_XML;
> }
>
> This is important as HTML documents without this declaration are really
> screwed up when viewed with IE.
>
> Regards
> Racke
It is a little more complicated than this, as the DOCTYPE declaration is
not necessarily well-formed. That might be why HTML::TreeBuilder doesn't
output it.
For example if the doctype doesn't have the SYSTEM part, as in
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
which is quite common, this will cause the parser to die.
I seems to make sense to do this in XML::Twig though, although you will
need to set the output_html_doctype option in XML::Twig->new to get the
behaviour you want. The main reason is that it is a bit of a pain to do,
so I might as well do it so no one else has to ;--(
The development version on github (
https://github.com/mirod/xmltwig) and
at
http://xmltwig.org/xmltwig/ implements that option, although it is
not thoroughly tested yet.
An other option is to use HTML::Tidy instead of HTML::TreeBuilder to do
the HTML to XML conversion. use the 'use_tidy' option when you create
the XML::Twig object. I have found HTML::Tidy a bit tricky to install,
but it often does a better job than HTML::TreeBuilder.
let me know if this works for you.
--
mirod