Subject: | Wishlist: more control over parse_html (using HTML::TreeBuilder) |
Date: | Fri, 21 Oct 2011 17:18:07 +0200 |
To: | bug-XML-Twig [...] rt.cpan.org |
From: | Zsbán Ambrus <ambrus [...] math.bme.hu> |
Hello,
I'd like to ask you to add a way in XML::Twig for the user to have
more control over HTML parsing with HTML::TreeBuilder.
I'd like to parse some HTML to an XML::Twig, but want to unset the
ignore_unknown option of HTML::TreeBuilder. There seems no easy way
to do this, because I never get to touch the TreeBuilder object if I
use the parse_html method of Twig, and the parse_html method does some
arcane fixups on the TreeBuilder object that I can't just emulate the
whole proceduce in user code.
Thus, it would be useful if XML::Twig had some interface for using
HTML::TreeBuilder that's more general than the parse_html method.
I'm not sure how this interface should look like, but here's an idea.
First, I'd call an object method of a Twig that constructs a
HTML::TreeBuilder object preset with the default options for later use
of this Twig. I would then call the parse_content or parse_file
method on the TreeBuilder, and possibly any other changes before or
after the parse. Then I call a second method of the Twig which loads
the tree from the TreeBuilder to the Twig.
I'd like to ask for your opinion on whether such an interface would
make sense. If yes, I might try to write a patch for it (but I don't
promise anything).
I am using XML::Twig version 3.39, whose configuration information I
attach to the bottom. (Btw, why does that printout not include the
version number of Twig itself?)
Ambrus
-----------
Configuration:
perl: 5.014002
OS: linux - x86_64-linux
required
XML::Parser : 2.41
expat : <no version information found>
Strongly Recommended
Scalar::Util : 1.23 (for improved memory management)
Encode : 2.42_01 (for encoding conversions)
Modules providing additional features
XML::XPathEngine : <not available> (to use XML::Twig::XPath)
XML::XPath : 1.13 (to use XML::Twig::XPath
if Tree::XPathEngine not available)
LWP : 6.02 (for the parseurl method)
HTML::TreeBuilder : 4.2 (to use parse_html and
parsefile_html)
HTML::Entities::Numbered : <not available> (to allow parsing of
HTML containing named entities)
HTML::Tidy : <not available> (to use parse_html and
parsefile_html with the use_tidy option)
HTML::Entities : 3.69 (for the html_encode filter)
Tie::IxHash : <not available> (for the keep_atts_order option)
Text::Wrap : 2009.0305 (to use the "wrapped"
option for pretty_print)
Modules used only by the auto tests
Test : 1.25_02
Test::Pod : 1.45
XML::Simple : <not available>
XML::Handler::YAWriter : <not available>
XML::SAX::Writer : <not available>
XML::Filter::BufferText : <not available>
IO::Scalar : <not available>
Please add this information to bug reports (you can run
t/zz_dump_config.t to get it)
if you are upgrading the module from a previous version, make sure you read the
Changes file for bug fixes, new features and the occasional
COMPATIBILITY WARNING
1..1
ok 1