Skip Menu |

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 71009
Status: resolved
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: racke [...] linuxia.de
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: 3.39



Subject: Preserve doctype declaration from HTML documents
I'm parsing HTML documents with XML::Twig starting with the following DOCTYPE declaration: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> But this declaration is missing from the output of $twig->sprint. Please consider changing the _html2xml method to carry over the declaration, e.g. replace my $xml= $tree->as_XML; with: my $xml; if (exists $tree->{_decl}) { $xml = $tree->{_decl}->as_XML . $tree->as_XML; } else { $xml = $tree->as_XML; } This is important as HTML documents without this declaration are really screwed up when viewed with IE. Regards Racke
Subject: Re: [rt.cpan.org #71009] Preserve doctype declaration from HTML documents
Date: Fri, 16 Sep 2011 15:58:37 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: mirod <xmltwig [...] gmail.com>
On 09/16/2011 11:47 AM, Stefan Hornburg via RT wrote: Show quoted text
> Fri Sep 16 05:47:37 2011: Request 71009 was acted upon. > Transaction: Ticket created by HORNBURG > Queue: XML-Twig > Subject: Preserve doctype declaration from HTML documents > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: racke@linuxia.de > Status: new > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71009> > > > I'm parsing HTML documents with XML::Twig starting with the > following DOCTYPE declaration: > > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > But this declaration is missing from the output of $twig->sprint. > > Please consider changing the _html2xml method to carry over the > declaration, e.g. replace > > my $xml= $tree->as_XML; > > with: > > my $xml; > > if (exists $tree->{_decl}) { > $xml = $tree->{_decl}->as_XML . $tree->as_XML; > } > else { > $xml = $tree->as_XML; > } > > This is important as HTML documents without this declaration are really > screwed up when viewed with IE. > > Regards > Racke
It is a little more complicated than this, as the DOCTYPE declaration is not necessarily well-formed. That might be why HTML::TreeBuilder doesn't output it. For example if the doctype doesn't have the SYSTEM part, as in <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> which is quite common, this will cause the parser to die. I seems to make sense to do this in XML::Twig though, although you will need to set the output_html_doctype option in XML::Twig->new to get the behaviour you want. The main reason is that it is a bit of a pain to do, so I might as well do it so no one else has to ;--( The development version on github (https://github.com/mirod/xmltwig) and at http://xmltwig.org/xmltwig/ implements that option, although it is not thoroughly tested yet. An other option is to use HTML::Tidy instead of HTML::TreeBuilder to do the HTML to XML conversion. use the 'use_tidy' option when you create the XML::Twig object. I have found HTML::Tidy a bit tricky to install, but it often does a better job than HTML::TreeBuilder. let me know if this works for you. -- mirod
Subject: Re: [rt.cpan.org #71009] Preserve doctype declaration from HTML documents
Date: Fri, 16 Sep 2011 16:06:55 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: "Stefan Hornburg (Racke)" <racke [...] linuxia.de>
On 09/16/2011 04:00 PM, xmltwig@gmail.com via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=71009> > > On 09/16/2011 11:47 AM, Stefan Hornburg via RT wrote:
>> Fri Sep 16 05:47:37 2011: Request 71009 was acted upon. >> Transaction: Ticket created by HORNBURG >> Queue: XML-Twig >> Subject: Preserve doctype declaration from HTML documents >> Broken in: (no value) >> Severity: (no value) >> Owner: Nobody >> Requestors: racke@linuxia.de >> Status: new >> Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71009> >> >> >> I'm parsing HTML documents with XML::Twig starting with the >> following DOCTYPE declaration: >> >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> >> But this declaration is missing from the output of $twig->sprint. >> >> Please consider changing the _html2xml method to carry over the >> declaration, e.g. replace >> >> my $xml= $tree->as_XML; >> >> with: >> >> my $xml; >> >> if (exists $tree->{_decl}) { >> $xml = $tree->{_decl}->as_XML . $tree->as_XML; >> } >> else { >> $xml = $tree->as_XML; >> } >> >> This is important as HTML documents without this declaration are really >> screwed up when viewed with IE. >> >> Regards >> Racke
> > It is a little more complicated than this, as the DOCTYPE declaration is > not necessarily well-formed. That might be why HTML::TreeBuilder doesn't > output it. > > For example if the doctype doesn't have the SYSTEM part, as in > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> > which is quite common, this will cause the parser to die. > > I seems to make sense to do this in XML::Twig though, although you will > need to set the output_html_doctype option in XML::Twig->new to get the > behaviour you want. The main reason is that it is a bit of a pain to do, > so I might as well do it so no one else has to ;--( > > The development version on github (https://github.com/mirod/xmltwig) and > at http://xmltwig.org/xmltwig/ implements that option, although it is > not thoroughly tested yet. > > An other option is to use HTML::Tidy instead of HTML::TreeBuilder to do > the HTML to XML conversion. use the 'use_tidy' option when you create > the XML::Twig object. I have found HTML::Tidy a bit tricky to install, > but it often does a better job than HTML::TreeBuilder. > > let me know if this works for you. >
Great! Thanks a lot for the quick answer, I'll try it out. Regards Racke -- LinuXia Systems => http://www.linuxia.de/ Expert Interchange Consulting and System Administration ICDEVGROUP => http://www.icdevgroup.org/ Interchange Development Team
Subject: Re: [rt.cpan.org #71009] Preserve doctype declaration from HTML documents
Date: Sun, 18 Sep 2011 15:31:47 +0200
To: bug-XML-Twig [...] rt.cpan.org
From: "Stefan Hornburg (Racke)" <racke [...] linuxia.de>
On 09/16/2011 04:00 PM, xmltwig@gmail.com via RT wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=71009> > > On 09/16/2011 11:47 AM, Stefan Hornburg via RT wrote:
>> Fri Sep 16 05:47:37 2011: Request 71009 was acted upon. >> Transaction: Ticket created by HORNBURG >> Queue: XML-Twig >> Subject: Preserve doctype declaration from HTML documents >> Broken in: (no value) >> Severity: (no value) >> Owner: Nobody >> Requestors: racke@linuxia.de >> Status: new >> Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71009> >> >> >> I'm parsing HTML documents with XML::Twig starting with the >> following DOCTYPE declaration: >> >> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> >> >> But this declaration is missing from the output of $twig->sprint. >> >> Please consider changing the _html2xml method to carry over the >> declaration, e.g. replace >> >> my $xml= $tree->as_XML; >> >> with: >> >> my $xml; >> >> if (exists $tree->{_decl}) { >> $xml = $tree->{_decl}->as_XML . $tree->as_XML; >> } >> else { >> $xml = $tree->as_XML; >> } >> >> This is important as HTML documents without this declaration are really >> screwed up when viewed with IE. >> >> Regards >> Racke
> > It is a little more complicated than this, as the DOCTYPE declaration is > not necessarily well-formed. That might be why HTML::TreeBuilder doesn't > output it. > > For example if the doctype doesn't have the SYSTEM part, as in > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> > which is quite common, this will cause the parser to die. > > I seems to make sense to do this in XML::Twig though, although you will > need to set the output_html_doctype option in XML::Twig->new to get the > behaviour you want. The main reason is that it is a bit of a pain to do, > so I might as well do it so no one else has to ;--( > > The development version on github (https://github.com/mirod/xmltwig) and > at http://xmltwig.org/xmltwig/ implements that option, although it is > not thoroughly tested yet. > > An other option is to use HTML::Tidy instead of HTML::TreeBuilder to do > the HTML to XML conversion. use the 'use_tidy' option when you create > the XML::Twig object. I have found HTML::Tidy a bit tricky to install, > but it often does a better job than HTML::TreeBuilder. > > let me know if this works for you. >
output_html_doctype option works fine for me, thanks a lot! Regards Racke -- LinuXia Systems => http://www.linuxia.de/ Expert Interchange Consulting and System Administration ICDEVGROUP => http://www.icdevgroup.org/ Interchange Development Team