Bug #71839 for XML-Twig: Wishlist: more control over parse

Fri Oct 21 11:18:46 2011 ambrus [...] math.bme.hu - Ticket created

Subject:	Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Fri, 21 Oct 2011 17:18:07 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	Zsbán Ambrus <ambrus [...] math.bme.hu>

Hello, I'd like to ask you to add a way in XML::Twig for the user to have more control over HTML parsing with HTML::TreeBuilder. I'd like to parse some HTML to an XML::Twig, but want to unset the ignore_unknown option of HTML::TreeBuilder. There seems no easy way to do this, because I never get to touch the TreeBuilder object if I use the parse_html method of Twig, and the parse_html method does some arcane fixups on the TreeBuilder object that I can't just emulate the whole proceduce in user code. Thus, it would be useful if XML::Twig had some interface for using HTML::TreeBuilder that's more general than the parse_html method. I'm not sure how this interface should look like, but here's an idea. First, I'd call an object method of a Twig that constructs a HTML::TreeBuilder object preset with the default options for later use of this Twig. I would then call the parse_content or parse_file method on the TreeBuilder, and possibly any other changes before or after the parse. Then I call a second method of the Twig which loads the tree from the TreeBuilder to the Twig. I'd like to ask for your opinion on whether such an interface would make sense. If yes, I might try to write a patch for it (but I don't promise anything). I am using XML::Twig version 3.39, whose configuration information I attach to the bottom. (Btw, why does that printout not include the version number of Twig itself?) Ambrus ----------- Configuration: perl: 5.014002 OS: linux - x86_64-linux required XML::Parser : 2.41 expat : <no version information found> Strongly Recommended Scalar::Util : 1.23 (for improved memory management) Encode : 2.42_01 (for encoding conversions) Modules providing additional features XML::XPathEngine : <not available> (to use XML::Twig::XPath) XML::XPath : 1.13 (to use XML::Twig::XPath if Tree::XPathEngine not available) LWP : 6.02 (for the parseurl method) HTML::TreeBuilder : 4.2 (to use parse_html and parsefile_html) HTML::Entities::Numbered : <not available> (to allow parsing of HTML containing named entities) HTML::Tidy : <not available> (to use parse_html and parsefile_html with the use_tidy option) HTML::Entities : 3.69 (for the html_encode filter) Tie::IxHash : <not available> (for the keep_atts_order option) Text::Wrap : 2009.0305 (to use the "wrapped" option for pretty_print) Modules used only by the auto tests Test : 1.25_02 Test::Pod : 1.45 XML::Simple : <not available> XML::Handler::YAWriter : <not available> XML::SAX::Writer : <not available> XML::Filter::BufferText : <not available> IO::Scalar : <not available> Please add this information to bug reports (you can run t/zz_dump_config.t to get it) if you are upgrading the module from a previous version, make sure you read the Changes file for bug fixes, new features and the occasional COMPATIBILITY WARNING 1..1 ok 1

Fri Oct 21 11:27:36 2011 xmltwig [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Fri, 21 Oct 2011 17:27:19 +0200
To:	bug-XML-Twig [...] rt.cpan.org
From:	mirod <xmltwig [...] gmail.com>

Hi, That's a good idea. I have to think about the interface. Probably allow passing a HTML::TreeBuilder object to the constructor. The object could be constructed by an XML::Twig class method, then modified by the user and passed to XML::Twig::new. Let me think about it some more, or have a try at it if you want. And adding the version of XML::Twig to t/zz_config.t is a good idea, I'll add it. Thanks a lot. -- mirod On 10/21/2011 05:18 PM, ambrus@math.bme.hu via RT wrote: Show quoted text

> Fri Oct 21 11:18:46 2011: Request 71839 was acted upon. > Transaction: Ticket created by ambrus@math.bme.hu > Queue: XML-Twig > Subject: Wishlist: more control over parse_html (using HTML::TreeBuilder) > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: ambrus@math.bme.hu > Status: new > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71839> > > > Hello, > > I'd like to ask you to add a way in XML::Twig for the user to have > more control over HTML parsing with HTML::TreeBuilder. > > I'd like to parse some HTML to an XML::Twig, but want to unset the > ignore_unknown option of HTML::TreeBuilder. There seems no easy way > to do this, because I never get to touch the TreeBuilder object if I > use the parse_html method of Twig, and the parse_html method does some > arcane fixups on the TreeBuilder object that I can't just emulate the > whole proceduce in user code. > > Thus, it would be useful if XML::Twig had some interface for using > HTML::TreeBuilder that's more general than the parse_html method. > > I'm not sure how this interface should look like, but here's an idea. > First, I'd call an object method of a Twig that constructs a > HTML::TreeBuilder object preset with the default options for later use > of this Twig. I would then call the parse_content or parse_file > method on the TreeBuilder, and possibly any other changes before or > after the parse. Then I call a second method of the Twig which loads > the tree from the TreeBuilder to the Twig. > > I'd like to ask for your opinion on whether such an interface would > make sense. If yes, I might try to write a patch for it (but I don't > promise anything). > > I am using XML::Twig version 3.39, whose configuration information I > attach to the bottom. (Btw, why does that printout not include the > version number of Twig itself?) > > Ambrus > > > ----------- > > Configuration: > > perl: 5.014002 > OS: linux - x86_64-linux > > required > XML::Parser : 2.41 > expat :<no version information found> > > Strongly Recommended > Scalar::Util : 1.23 (for improved memory management) > Encode : 2.42_01 (for encoding conversions) > > Modules providing additional features > XML::XPathEngine :<not available> (to use XML::Twig::XPath) > XML::XPath : 1.13 (to use XML::Twig::XPath > if Tree::XPathEngine not available) > LWP : 6.02 (for the parseurl method) > HTML::TreeBuilder : 4.2 (to use parse_html and > parsefile_html) > HTML::Entities::Numbered :<not available> (to allow parsing of > HTML containing named entities) > HTML::Tidy :<not available> (to use parse_html and > parsefile_html with the use_tidy option) > HTML::Entities : 3.69 (for the html_encode filter) > Tie::IxHash :<not available> (for the keep_atts_order option) > Text::Wrap : 2009.0305 (to use the "wrapped" > option for pretty_print) > > Modules used only by the auto tests > Test : 1.25_02 > Test::Pod : 1.45 > XML::Simple :<not available> > XML::Handler::YAWriter :<not available> > XML::SAX::Writer :<not available> > XML::Filter::BufferText :<not available> > IO::Scalar :<not available> > > > Please add this information to bug reports (you can run > t/zz_dump_config.t to get it) > > if you are upgrading the module from a previous version, make sure you read the > Changes file for bug fixes, new features and the occasional > COMPATIBILITY WARNING > > 1..1 > ok 1 >

Fri Oct 21 11:27:37 2011 The RT System itself - Status changed from 'new' to 'open'

Wed Nov 16 10:13:36 2011 ambrus [...] math.bme.hu - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Wed, 16 Nov 2011 16:12:55 +0100
To:	bug-XML-Twig [...] rt.cpan.org
From:	Zsbán Ambrus <ambrus [...] math.bme.hu>

On Fri, Oct 21, 2011 at 5:27 PM, xmltwig@gmail.com via RT <bug-XML-Twig@rt.cpan.org> wrote: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=71839 >

As I'd like to write a patch for this, let me ask a few questions. Show quoted text

> Probably allow > passing a HTML::TreeBuilder object to the constructor. The object could > be constructed by an XML::Twig class method, then modified by the user > and passed to XML::Twig::new.

I'm not sure I like that idea. In the future, you could want some Twig constructor options that affect parsing with the HTML parser, and that need to be used at the time you construct the HTML parser. Though I'm more thinking of parse options that affect XML and HTML parsing the same way, it turns out that in fact you already have a constructor option TidyOptions doing this. Thus it would make more sense if you created the TreeBuilder object after constructing the Twig, with an extra method call. Other questions. In XML-Twig-3.39/Twig_pm.slow line 898 (in the _html2xml function), is there a point to calling $tree->delete a second time? What's the status of parsing with XML::Tidy? I haven't tried whether it works, and it seems documented only halfway. Are there at least some tests for it so if I install XML::Tidy I can be sure I don't break the logic accidentally? Ambrus

Thu Nov 17 09:01:34 2011 xmltwig [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Thu, 17 Nov 2011 15:00:53 +0100
To:	bug-XML-Twig [...] rt.cpan.org
From:	mirod <xmltwig [...] gmail.com>

On 11/16/2011 04:13 PM, ambrus@math.bme.hu via RT wrote: Show quoted text

> Queue: XML-Twig > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71839> > > On Fri, Oct 21, 2011 at 5:27 PM, xmltwig@gmail.com via RT > <bug-XML-Twig@rt.cpan.org> wrote:

>> <URL: https://rt.cpan.org/Ticket/Display.html?id=71839>

> > As I'd like to write a patch for this, let me ask a few questions. >

>> Probably allow >> passing a HTML::TreeBuilder object to the constructor. The object could >> be constructed by an XML::Twig class method, then modified by the user >> and passed to XML::Twig::new.

> > I'm not sure I like that idea. In the future, you could want some > Twig constructor options that affect parsing with the HTML parser, and > that need to be used at the time you construct the HTML parser. > Though I'm more thinking of parse options that affect XML and HTML > parsing the same way, it turns out that in fact you already have a > constructor option TidyOptions doing this. Thus it would make more > sense if you created the TreeBuilder object after constructing the > Twig, with an extra method call.

OK, go for it. As it is the way the HTML conversion is designed works for me, but as you have different requirements, as long as the existing way works and the tests still pass, then go for it. Show quoted text

> Other questions. In XML-Twig-3.39/Twig_pm.slow line 898 (in the > _html2xml function), is there a point to calling $tree->delete a > second time?

Apparently not. I removed it. Show quoted text

> What's the status of parsing with XML::Tidy? I haven't tried whether > it works, and it seems documented only halfway. Are there at least > some tests for it so if I install XML::Tidy I can be sure I don't > break the logic accidentally?

It's HTML::Tidy, not XML::Tidy. The tests are in t/test_3_36.t and in t/test_memory.t I tend to use HTML::Tidy. I tend to use HTML::Tidy to convert to XHTML these days, as in a few case the output is more faithful to the original HTML than with HTML::TreeBuilder, so it is used, if not properly documented. I'll go back to the docs and see if I can improve them. Thanks -- mirod

Thu Nov 17 16:22:16 2011 ambrus [...] math.bme.hu - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Thu, 17 Nov 2011 22:21:34 +0100
To:	bug-XML-Twig [...] rt.cpan.org
From:	Zsbán Ambrus <ambrus [...] math.bme.hu>

I was wondering if I could use XML::LibXML as a third HTML parser for Twig. I got this. $ perl -we 'use XML::LibXML; my $pa = XML::LibXML->new(recover => 1); my $tr = $pa->load_html(string => "<.m/>"); my $xm = $tr->toString(0); print $xm, "\n";' HTML parser error : Tag .m invalid <.m/> ^ <?xml version="1.0" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><.m/></body></html> $ Turns out, XML::LibXML outputs an invalid element name, so I can't parse the output as XML. ARGH! I've got deja vu: I just fixed a similar error with HTML::Tree and invalid attribute names ("https://rt.cpan.org/Public/Bug/Display.html?id=71805"). But in this case, the bug is likely inside C code, so I can't fix it so easily. I wonder, how can one work this around? Ambrus

Mon Nov 21 07:28:00 2011 xmltwig [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Mon, 21 Nov 2011 13:27:14 +0100
To:	bug-XML-Twig [...] rt.cpan.org
From:	mirod <xmltwig [...] gmail.com>

On 11/17/2011 10:22 PM, ambrus@math.bme.hu via RT wrote: Show quoted text

> Queue: XML-Twig > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71839> > > I was wondering if I could use XML::LibXML as a third HTML parser for > Twig. I got this. > > $ perl -we 'use XML::LibXML; my $pa = XML::LibXML->new(recover => 1); > my $tr = $pa->load_html(string => "<.m/>"); my $xm = $tr->toString(0); > print $xm, "\n";' > HTML parser error : Tag .m invalid > <.m/> > ^ > <?xml version="1.0" standalone="yes"?> > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > "http://www.w3.org/TR/REC-html40/loose.dtd"> > <html><body><.m/></body></html> > > $ > > Turns out, XML::LibXML outputs an invalid element name, so I can't > parse the output as XML. ARGH! > > I've got deja vu: I just fixed a similar error with HTML::Tree and > invalid attribute names > ("https://rt.cpan.org/Public/Bug/Display.html?id=71805"). But in this > case, the bug is likely inside C code, so I can't fix it so easily. > > I wonder, how can one work this around?

Sorry it took me so long to answer, there seems to be even more problems than I thought with HTML::TreeBuilder, and I am trying to fix them. When I tested HTML to XML conversion, a few years ago, I found that, at least at the time, XML::LibXML did not deal very well with real-world HTML, so I did not look further into it. I am aware of the problems with HTML::TreeBuilder conversion to XHTML. Generally speaking, as_XML does not seem to be the most developed feature of the module. And dying when finding something invalid in the HTML is quite the opposite of what I am looking for in an HTML parsing module. There is also the problem that HTML::Parser has not been updated for HTML5. -- mirod

Mon Nov 21 08:05:12 2011 ambrus [...] math.bme.hu - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Mon, 21 Nov 2011 14:04:31 +0100
To:	bug-XML-Twig [...] rt.cpan.org
From:	Zsbán Ambrus <ambrus [...] math.bme.hu>

On Mon, Nov 21, 2011 at 1:28 PM, xmltwig@gmail.com via RT <bug-XML-Twig@rt.cpan.org> wrote: Show quoted text

> When I tested HTML to XML conversion, a few years ago, I found that, at > least at the time, XML::LibXML did not deal very well with real-world > HTML, so I did not look further into it.

I have also looked a bit more at XML::LibXML. I've decided I'll probably ignore it for now, and will use either HTML::Tree or HTML::Tidy for HTML parsing. I've sent them a bug report about this one bug I've mentioned to you though: "http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=649189" . (I won't completely forget about libxml2 though. If I ever want to do XML mangling with perl not involved, I'll certainly check it out how good their documentation is.) Show quoted text

> Sorry it took me so long to answer, there seems to be even more problems > than I thought with HTML::TreeBuilder, and I am trying to fix them.

Show quoted text

> And dying when finding something invalid in the > HTML is quite the opposite of what I am looking for in an HTML parsing > module.

Apply the patch from "https://rt.cpan.org/Public/Bug/Display.html?id=71805" first then. Show quoted text

> There is also the problem that HTML::Parser has not been updated for HTML5.

True, but XML::LibXML and HTML::Tidy has also not been updated. One tricky part of HTML5 parsing is that HTML::Parser would have to be aware of the encoding of the document to know what data-* attribute names are valid. Ambrus

Mon Nov 21 08:07:32 2011 xmltwig [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #71839] Wishlist: more control over parse_html (using HTML::TreeBuilder)
Date:	Mon, 21 Nov 2011 14:06:48 +0100
To:	bug-XML-Twig [...] rt.cpan.org
From:	mirod <xmltwig [...] gmail.com>

On 11/17/2011 10:22 PM, ambrus@math.bme.hu via RT wrote: Show quoted text

> Queue: XML-Twig > Ticket<URL: https://rt.cpan.org/Ticket/Display.html?id=71839> > > I was wondering if I could use XML::LibXML as a third HTML parser for > Twig. I got this. > > $ perl -we 'use XML::LibXML; my $pa = XML::LibXML->new(recover => 1); > my $tr = $pa->load_html(string => "<.m/>"); my $xm = $tr->toString(0); > print $xm, "\n";' > HTML parser error : Tag .m invalid > <.m/> > ^ > <?xml version="1.0" standalone="yes"?> > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > "http://www.w3.org/TR/REC-html40/loose.dtd"> > <html><body><.m/></body></html> > > $ > > Turns out, XML::LibXML outputs an invalid element name, so I can't > parse the output as XML. ARGH! > > I've got deja vu: I just fixed a similar error with HTML::Tree and > invalid attribute names > ("https://rt.cpan.org/Public/Bug/Display.html?id=71805"). But in this > case, the bug is likely inside C code, so I can't fix it so easily. > > I wonder, how can one work this around?

Sorry it took me so long to answer, there seems to be even more problems than I thought with HTML::TreeBuilder, and I am trying to fix them. When I tested HTML to XML conversion, a few years ago, I found that, at least at the time, XML::LibXML did not deal very well with real-world HTML, so I did not look further into it. I am aware of the problems with HTML::TreeBuilder conversion to XHTML. Generally speaking, as_XML does not seem to be the most developed feature of the module. And dying when finding something invalid in the HTML is quite the opposite of what I am looking for in an HTML parsing module. There is also the problem that HTML::Parser has not been updated for HTML5. -- mirod

Bug #71839 for XML-Twig: Wishlist: more control over parse_html (using HTML::TreeBuilder)