Date: | Thu, 21 Oct 2004 20:08:59 +0200 |
From: | Dominique Quatravaux <dom [...] idealx.com> |
To: | bug-www-mechanize [...] rt.cpan.org, www-mechanize-development [...] lists.sf.net |
CC: | Benoit Picaud <benoit [...] IDEALX.com> |
Subject: | [Enhancement] new ->content() features, HTML filter policy |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello WWW-Mechanists,
Here comes another patch proposal against the current CVS (+ all four
patches that I have currently pending). What it does:
~ * encourage developpers to overload ->update_html($html) instead
~ of ->_parse_html(). Since update_html() takes an argument, and
~ as outlined in the pod, this allows one to straightforwardly
~ provide an arbitrary HTML filter to the mech:
~ sub My::Subclass::update_html {
~ my ($self, $html) = @_;
~ $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg;
~ $self->WWW::Mechanize::update_html( $html );
~ }
~ Of course the old ->_parse_html() is still called for
compatibility (I've seen more than one instance of folks who need that
to continue to work), even though in the new base class' version it
does nothing.
~ * Resubmit part of the extensions proposed by Nicolas Thiery in
~ July, and OKed by Mark Stosberg - but there were no tests by
~ then :-) and Mark raised issues with the API, that I took into
~ consideration this time around. We now have
~ $mech->content("text") (optional dependency to HTML::TreeBuilder
~ - I plan to use that too for some of my future work, if this is
~ not acceptable even as an optional dependency please let me
~ know), and $mech->content(base_href => $url). Please credit
~ Nicolas Thiery for both.
Any comments on my past, present, future work on the Mech? Please? :-)
- --
Dominique QUATRAVAUX Ingénieur senior
01 44 42 00 08 IDEALX
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFBd/s7MJAKAU3mjcsRAvPlAJ0ckY/7mmb7lPna2MAXIhykyqeUUgCgtmbB
4lC+SezxluxK+qS3qE2G4UY=
=2fzE
-----END PGP SIGNATURE-----
diff -Nr -U3 www-mechanize.BASE/lib/WWW/Mechanize.pm www-mechanize/lib/WWW/Mechanize.pm
--- www-mechanize.BASE/lib/WWW/Mechanize.pm Thu Oct 21 13:40:21 2004
+++ www-mechanize/lib/WWW/Mechanize.pm Thu Oct 21 19:43:14 2004
@@ -927,10 +927,6 @@
Returns the base URI for the current response
-=head2 $mech->content()
-
-Returns the content for the response
-
=head2 $mech->forms()
When called in a list context, returns a list of the forms found in
@@ -962,7 +958,6 @@
sub status { my $self = shift; return $self->{status}; }
sub ct { my $self = shift; return $self->{ct}; }
sub base { my $self = shift; return $self->{base}; }
-sub content { my $self = shift; return $self->{content}; }
sub current_form { my $self = shift; return $self->{form}; }
sub is_html { my $self = shift; return defined $self->{ct} && ($self->{ct} eq "text/html"); }
@@ -998,6 +993,62 @@
=head1 CONTENT-HANDLING METHODS
+=head2 $mech->content(...)
+
+Returns the content that the mech uses internally for the last page
+fetched. Ordinarily this is the same as $mech->response()->content(),
+but this may differ for HTML documents if L</update_html> is
+overloaded (in which case the value passed to the base-class
+implementation of same will be returned), and/or extra arguments are
+passed to I<content()>:
+
+=over 2
+
+=item I<< $mech->content("text") >>
+
+Returns a text-only version of the page, with all HTML markup
+stripped. This feature requires I<HTML::TreeBuilder> to be installed,
+or a fatal error will be thrown.
+
+=item I<< $mech->content("base_href") >>
+
+=item I<< $mech->content("base_href", $base_href) >>
+
+Returns the HTML document, modified to contain a C<< <base
+href="$base_href"> >> mark-up in the header. $base_href is C<<
+$mech->base() >> if not specified. This is handy to pass the HTML to
+e.g. L<HTML::Display>.
+
+=back
+
+Passing parameters to content() if the current document is not HTML
+has no effect (i.e. the return value is the same as
+$self->response()->content()).
+
+=cut
+
+sub content {
+ my $self = shift;
+ my $content = $self->{content};
+ return $content unless $self->is_html;
+
+ if (defined(my $transform = shift)) {
+ if ($transform eq "base_href") {
+ my $base = shift || $self->base;
+ $content=~s/<head>/<head>\n<base href="$base">/;
+ } elsif ($transform eq "text") {
+ require HTML::TreeBuilder;
+ my $tree = HTML::TreeBuilder->new();
+ $tree->parse($self->content());
+ $tree->eof();
+ $tree->elementify(); # just for safety
+ $content = $tree->as_text();
+ }
+ }
+
+ return $content;
+}
+
=head2 $mech->find_link()
This method finds a link in the currently fetched page. It returns a
@@ -1337,7 +1388,7 @@
=head2 $mech->request( $request [, $arg [, $size]])
Overloaded version of C<request()> in L<LWP::UserAgent>. Performs
-the actual request. Normally, if you're using WWW::Mechanize, it'd
+the actual request. Normally, if you're using WWW::Mechanize, it's
because you don't want to deal with this level of stuff anyway.
Note that C<$request> will be modified.
@@ -1366,7 +1417,6 @@
$self->{status} = $res->code;
$self->{base} = $res->base;
$self->{ct} = $res->content_type || "";
- $self->{content} = $res->content;
if ( $res->is_success ) {
$self->{uri} = $self->{redirected_uri};
@@ -1377,8 +1427,12 @@
}
}
- $self->_reset_page;
- $self->_parse_html if $self->is_html;
+ $self->_reset_page;
+ if ($self->is_html) {
+ $self->update_html($res->content);
+ } else {
+ $self->{content} = $res->content;
+ }
return $res;
} # request
@@ -1386,7 +1440,7 @@
=head2 $mech->update_html( $html )
Allows you to replace the HTML that the mech has found. Updates the
-forms and links.
+forms and links parse-trees that the mech uses internally.
Say you have a page that you know has malformed output, and you want to
update it so the links come out correctly:
@@ -1395,6 +1449,29 @@
$html =~ s[</option>.?.?.?</td>][</option></select></td>]isg;
$mech->update_html( $html );
+This method is also used internally by the mech itself to update its
+own HTML content when loading a page. This means that if you would
+like to I<systematically> perform the above HTML substitution, you
+would overload I<update_html> in a subclass thusly:
+
+ package MyMech;
+ use base 'WWW::Mechanize';
+
+ sub update_html {
+ my ($self, $html) = @_;
+ $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg;
+ $self->WWW::Mechanize::update_html( $html );
+ }
+
+If you do this, then the mech will use the tidied-up HTML instead of
+the original both when parsing for its own needs, and for returning to
+you through L</content>.
+
+Overloading this method is also the recommended way of implementing
+extra validation steps (e.g. link checkers) for every HTML page
+received. L</warn> and L</die> would then come in handy to signal
+validation errors.
+
=cut
sub update_html {
@@ -1404,22 +1481,8 @@
$self->_reset_page;
$self->{ct} = 'text/html';
$self->{content} = $html;
- $self->_parse_html;
- return;
-}
-
-=head2 $mech->_parse_html()
-
-An internal method that initializes forms and links given a HTML document.
-If you need to override this in your subclass, or call it multiple times,
-go ahead.
-
-=cut
-
-sub _parse_html {
- my $self = shift;
- $self->{forms} = [ HTML::Form->parse($self->content, $self->base) ];
+ $self->{forms} = [ HTML::Form->parse($html, $self->base) ];
if (@{ $self->{forms} }) {
for my $form (@{ $self->{forms} }) {
for my $input ($form->inputs) {
@@ -1431,6 +1494,11 @@
}
$self->{form} = $self->{forms}->[0];
$self->_extract_links();
+
+ $self->_parse_html(); #For compatibility with folks that used to
+ # overload that method.
+
+ return;
}
=head2 $mech->_modify_request( $req )
@@ -1545,6 +1613,17 @@
return $arg =~ /^\d+$/ ? $self->form_number($arg) : $self->form_name($arg);
}
+
+=head2 $mech->_parse_html()
+
+An internal method that initializes forms and links given a HTML
+document. Overriding this in your subclass is B<DEPRECATED>, better
+override L</update_html> instead in your new code.
+
+=cut
+
+sub _parse_html { }
+
=head1 INTERNAL-ONLY METHODS
diff -Nr -U3 www-mechanize.BASE/t/content.t www-mechanize/t/content.t
--- www-mechanize.BASE/t/content.t Thu Jan 1 01:00:00 1970
+++ www-mechanize/t/content.t Thu Oct 21 19:44:25 2004
@@ -0,0 +1,65 @@
+use warnings;
+use strict;
+use lib 't/local';
+use LocalServer;
+use Test::More tests => 5;
+
+=pod
+
+=head1 NAME
+
+content.t
+
+=head1 SYNOPSIS
+
+Tests the transforming forms of $mech->content().
+
+=cut
+
+BEGIN { delete @ENV{ qw( http_proxy HTTP_PROXY ) }; }
+BEGIN {
+ use_ok( 'WWW::Mechanize' );
+}
+
+my $html = <<'HTML';
+<html>
+<head>
+<title>Howdy?</title>
+</head>
+<body>
+Fine, thx!
+</body>
+</html>
+HTML
+
+
+my $mech = WWW::Mechanize->new();
+# Well actually there is no base (and therefore it does not belong to us
+# :-), so let's kludge a bit.
+$mech->{base} = "http://example.com/";
+$mech->update_html($html);
+
+=head2 $mech->content("text")
+
+=cut
+
+SKIP: {
+ eval "use HTML::TreeBuilder";
+ skip "HTML::TreeBuilder not installed", 2 if $@;
+
+ my $text = $mech->content("text");
+ like( $text, qr/Fine/);
+ unlike( $text, qr/html/i);
+}
+
+=head2 $mech->content("base_href" [, $base ])
+
+=cut
+
+my $content = $mech->content("base_href", "foo");
+like($content, qr/base href="foo"/);
+
+
+$content = $mech->content("base_href");
+like($content, qr[base href="http://example.com/"]);
+
diff -Nr -U3 www-mechanize.BASE/t/local/overload.t www-mechanize/t/local/overload.t
--- www-mechanize.BASE/t/local/overload.t Thu Jan 1 01:00:00 1970
+++ www-mechanize/t/local/overload.t Thu Oct 21 19:03:25 2004
@@ -0,0 +1,82 @@
+use warnings;
+use strict;
+use lib 't/local';
+use LocalServer;
+use Test::More tests => 10;
+
+=pod
+
+=head1 NAME
+
+overload.t
+
+=head1 SYNOPSIS
+
+This tests for various ways, advertised in L<WWW::Mechanize>, to
+create a subclass of the mech to alter it's behavior in a useful
+manner. (Of course free-style overloading is discouraged, as it breaks
+encapsulation big time.)
+
+=head2 Overloading update_html()
+
+This is the recommended way to tidy up the received HTML in a generic
+way, and/or to install supplemental "surface tests" on the HTML
+(e.g. link checker).
+
+=cut
+
+BEGIN { delete @ENV{ qw( http_proxy HTTP_PROXY ) }; }
+BEGIN {
+ use_ok( 'WWW::Mechanize' );
+}
+
+my $server = LocalServer->spawn(html => <<'BROKEN_HTML');
+<html>
+<head><title>Broken document</head>
+<form>
+<table>
+<tr><select name="foo">
+<option value="bar">Bar</option></td></tr>
+</form>
+</html>
+BROKEN_HTML
+isa_ok( $server, 'LocalServer' );
+
+do {
+ package MyMech;
+ use base 'WWW::Mechanize';
+
+ sub update_html {
+ my ($self, $html) = @_;
+ $html =~ s[Broken][Fixed]isg;
+ $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg;
+
+ $self->WWW::Mechanize::update_html( $html );
+ }
+};
+
+my $carpmsg;
+local $^W = 1;
+local *Carp::carp = sub {$carpmsg = shift};
+
+my $mech = WWW::Mechanize->new();
+do {
+ $mech->get ($server->url);
+ like($carpmsg, qr/bad.*select/i, "Standard mech chokes on bogus HTML");
+};
+
+# If at first you don't succeed, try with a shorter bungee...
+undef $carpmsg;
+$mech = MyMech->new();
+isa_ok( $mech, 'WWW::Mechanize', 'Derived object' );
+
+my $response = $mech->get( $server->url );
+isa_ok( $response, 'HTTP::Response', 'Response I got back' );
+ok( $response->is_success, 'Got URL' ) or die "Can't even fetch local url";
+ok( $mech->is_html, "Local page is HTML" );
+ok(! $carpmsg, "No warnings this time");
+
+my @forms = $mech->forms;
+is( scalar @forms, 1, "One form" );
+
+like($mech->content(), qr[/select], "alteration visible in ->content() too");