Bug #8087 for WWW-Mechanize: [Enhancement] new ->content() features, HTML filter policy

Thu Oct 21 14:07:09 2004 dom [...] idealx.com - Ticket created

Date:	Thu, 21 Oct 2004 20:08:59 +0200
From:	Dominique Quatravaux <dom [...] idealx.com>
To:	bug-www-mechanize [...] rt.cpan.org, www-mechanize-development [...] lists.sf.net
CC:	Benoit Picaud <benoit [...] IDEALX.com>
Subject:	[Enhancement] new ->content() features, HTML filter policy

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello WWW-Mechanists, Here comes another patch proposal against the current CVS (+ all four patches that I have currently pending). What it does: ~ * encourage developpers to overload ->update_html($html) instead ~ of ->_parse_html(). Since update_html() takes an argument, and ~ as outlined in the pod, this allows one to straightforwardly ~ provide an arbitrary HTML filter to the mech: ~ sub My::Subclass::update_html { ~ my ($self, $html) = @_; ~ $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg; ~ $self->WWW::Mechanize::update_html( $html ); ~ } ~ Of course the old ->_parse_html() is still called for compatibility (I've seen more than one instance of folks who need that to continue to work), even though in the new base class' version it does nothing. ~ * Resubmit part of the extensions proposed by Nicolas Thiery in ~ July, and OKed by Mark Stosberg - but there were no tests by ~ then :-) and Mark raised issues with the API, that I took into ~ consideration this time around. We now have ~ $mech->content("text") (optional dependency to HTML::TreeBuilder ~ - I plan to use that too for some of my future work, if this is ~ not acceptable even as an optional dependency please let me ~ know), and $mech->content(base_href => $url). Please credit ~ Nicolas Thiery for both. Any comments on my past, present, future work on the Mech? Please? :-) - -- Dominique QUATRAVAUX Ingénieur senior 01 44 42 00 08 IDEALX -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBd/s7MJAKAU3mjcsRAvPlAJ0ckY/7mmb7lPna2MAXIhykyqeUUgCgtmbB 4lC+SezxluxK+qS3qE2G4UY= =2fzE -----END PGP SIGNATURE-----

diff -Nr -U3 www-mechanize.BASE/lib/WWW/Mechanize.pm www-mechanize/lib/WWW/Mechanize.pm --- www-mechanize.BASE/lib/WWW/Mechanize.pm Thu Oct 21 13:40:21 2004 +++ www-mechanize/lib/WWW/Mechanize.pm Thu Oct 21 19:43:14 2004 @@ -927,10 +927,6 @@ Returns the base URI for the current response -=head2 $mech->content() - -Returns the content for the response - =head2 $mech->forms() When called in a list context, returns a list of the forms found in @@ -962,7 +958,6 @@ sub status { my $self = shift; return $self->{status}; } sub ct { my $self = shift; return $self->{ct}; } sub base { my $self = shift; return $self->{base}; } -sub content { my $self = shift; return $self->{content}; } sub current_form { my $self = shift; return $self->{form}; } sub is_html { my $self = shift; return defined $self->{ct} && ($self->{ct} eq "text/html"); } @@ -998,6 +993,62 @@ =head1 CONTENT-HANDLING METHODS +=head2 $mech->content(...) + +Returns the content that the mech uses internally for the last page +fetched. Ordinarily this is the same as $mech->response()->content(), +but this may differ for HTML documents if L</update_html> is +overloaded (in which case the value passed to the base-class +implementation of same will be returned), and/or extra arguments are +passed to I<content()>: + +=over 2 + +=item I<< $mech->content("text") >> + +Returns a text-only version of the page, with all HTML markup +stripped. This feature requires I<HTML::TreeBuilder> to be installed, +or a fatal error will be thrown. + +=item I<< $mech->content("base_href") >> + +=item I<< $mech->content("base_href", $base_href) >> + +Returns the HTML document, modified to contain a C<< <base +href="$base_href"> >> mark-up in the header. $base_href is C<< +$mech->base() >> if not specified. This is handy to pass the HTML to +e.g. L<HTML::Display>. + +=back + +Passing parameters to content() if the current document is not HTML +has no effect (i.e. the return value is the same as +$self->response()->content()). + +=cut + +sub content { + my $self = shift; + my $content = $self->{content}; + return $content unless $self->is_html; + + if (defined(my $transform = shift)) { + if ($transform eq "base_href") { + my $base = shift || $self->base; + $content=~s/<head>/<head>\n<base href="$base">/; + } elsif ($transform eq "text") { + require HTML::TreeBuilder; + my $tree = HTML::TreeBuilder->new(); + $tree->parse($self->content()); + $tree->eof(); + $tree->elementify(); # just for safety + $content = $tree->as_text(); + } + } + + return $content; +} + =head2 $mech->find_link() This method finds a link in the currently fetched page. It returns a @@ -1337,7 +1388,7 @@ =head2 $mech->request( $request [, $arg [, $size]]) Overloaded version of C<request()> in L<LWP::UserAgent>. Performs -the actual request. Normally, if you're using WWW::Mechanize, it'd +the actual request. Normally, if you're using WWW::Mechanize, it's because you don't want to deal with this level of stuff anyway. Note that C<$request> will be modified. @@ -1366,7 +1417,6 @@ $self->{status} = $res->code; $self->{base} = $res->base; $self->{ct} = $res->content_type || ""; - $self->{content} = $res->content; if ( $res->is_success ) { $self->{uri} = $self->{redirected_uri}; @@ -1377,8 +1427,12 @@ } } - $self->_reset_page; - $self->_parse_html if $self->is_html; + $self->_reset_page; + if ($self->is_html) { + $self->update_html($res->content); + } else { + $self->{content} = $res->content; + } return $res; } # request @@ -1386,7 +1440,7 @@ =head2 $mech->update_html( $html ) Allows you to replace the HTML that the mech has found. Updates the -forms and links. +forms and links parse-trees that the mech uses internally. Say you have a page that you know has malformed output, and you want to update it so the links come out correctly: @@ -1395,6 +1449,29 @@ $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg; $mech->update_html( $html ); +This method is also used internally by the mech itself to update its +own HTML content when loading a page. This means that if you would +like to I<systematically> perform the above HTML substitution, you +would overload I<update_html> in a subclass thusly: + + package MyMech; + use base 'WWW::Mechanize'; + + sub update_html { + my ($self, $html) = @_; + $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg; + $self->WWW::Mechanize::update_html( $html ); + } + +If you do this, then the mech will use the tidied-up HTML instead of +the original both when parsing for its own needs, and for returning to +you through L</content>. + +Overloading this method is also the recommended way of implementing +extra validation steps (e.g. link checkers) for every HTML page +received. L</warn> and L</die> would then come in handy to signal +validation errors. + =cut sub update_html { @@ -1404,22 +1481,8 @@ $self->_reset_page; $self->{ct} = 'text/html'; $self->{content} = $html; - $self->_parse_html; - return; -} - -=head2 $mech->_parse_html() - -An internal method that initializes forms and links given a HTML document. -If you need to override this in your subclass, or call it multiple times, -go ahead. - -=cut - -sub _parse_html { - my $self = shift; - $self->{forms} = [ HTML::Form->parse($self->content, $self->base) ]; + $self->{forms} = [ HTML::Form->parse($html, $self->base) ]; if (@{ $self->{forms} }) { for my $form (@{ $self->{forms} }) { for my $input ($form->inputs) { @@ -1431,6 +1494,11 @@ } $self->{form} = $self->{forms}->[0]; $self->_extract_links(); + + $self->_parse_html(); #For compatibility with folks that used to + # overload that method. + + return; } =head2 $mech->_modify_request( $req ) @@ -1545,6 +1613,17 @@ return $arg =~ /^\d+$/ ? $self->form_number($arg) : $self->form_name($arg); } + +=head2 $mech->_parse_html() + +An internal method that initializes forms and links given a HTML +document. Overriding this in your subclass is B<DEPRECATED>, better +override L</update_html> instead in your new code. + +=cut + +sub _parse_html { } + =head1 INTERNAL-ONLY METHODS diff -Nr -U3 www-mechanize.BASE/t/content.t www-mechanize/t/content.t --- www-mechanize.BASE/t/content.t Thu Jan 1 01:00:00 1970 +++ www-mechanize/t/content.t Thu Oct 21 19:44:25 2004 @@ -0,0 +1,65 @@ +use warnings; +use strict; +use lib 't/local'; +use LocalServer; +use Test::More tests => 5; + +=pod + +=head1 NAME + +content.t + +=head1 SYNOPSIS + +Tests the transforming forms of $mech->content(). + +=cut + +BEGIN { delete @ENV{ qw( http_proxy HTTP_PROXY ) }; } +BEGIN { + use_ok( 'WWW::Mechanize' ); +} + +my $html = <<'HTML'; +<html> +<head> +<title>Howdy?</title> +</head> +<body> +Fine, thx! +</body> +</html> +HTML + + +my $mech = WWW::Mechanize->new(); +# Well actually there is no base (and therefore it does not belong to us +# :-), so let's kludge a bit. +$mech->{base} = "http://example.com/"; +$mech->update_html($html); + +=head2 $mech->content("text") + +=cut + +SKIP: { + eval "use HTML::TreeBuilder"; + skip "HTML::TreeBuilder not installed", 2 if $@; + + my $text = $mech->content("text"); + like( $text, qr/Fine/); + unlike( $text, qr/html/i); +} + +=head2 $mech->content("base_href" [, $base ]) + +=cut + +my $content = $mech->content("base_href", "foo"); +like($content, qr/base href="foo"/); + + +$content = $mech->content("base_href"); +like($content, qr[base href="http://example.com/"]); + diff -Nr -U3 www-mechanize.BASE/t/local/overload.t www-mechanize/t/local/overload.t --- www-mechanize.BASE/t/local/overload.t Thu Jan 1 01:00:00 1970 +++ www-mechanize/t/local/overload.t Thu Oct 21 19:03:25 2004 @@ -0,0 +1,82 @@ +use warnings; +use strict; +use lib 't/local'; +use LocalServer; +use Test::More tests => 10; + +=pod + +=head1 NAME + +overload.t + +=head1 SYNOPSIS + +This tests for various ways, advertised in L<WWW::Mechanize>, to +create a subclass of the mech to alter it's behavior in a useful +manner. (Of course free-style overloading is discouraged, as it breaks +encapsulation big time.) + +=head2 Overloading update_html() + +This is the recommended way to tidy up the received HTML in a generic +way, and/or to install supplemental "surface tests" on the HTML +(e.g. link checker). + +=cut + +BEGIN { delete @ENV{ qw( http_proxy HTTP_PROXY ) }; } +BEGIN { + use_ok( 'WWW::Mechanize' ); +} + +my $server = LocalServer->spawn(html => <<'BROKEN_HTML'); +<html> +<head><title>Broken document</head> +<form> +<table> +<tr><select name="foo"> +<option value="bar">Bar</option></td></tr> +</form> +</html> +BROKEN_HTML +isa_ok( $server, 'LocalServer' ); + +do { + package MyMech; + use base 'WWW::Mechanize'; + + sub update_html { + my ($self, $html) = @_; + $html =~ s[Broken][Fixed]isg; + $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg; + + $self->WWW::Mechanize::update_html( $html ); + } +}; + +my $carpmsg; +local $^W = 1; +local *Carp::carp = sub {$carpmsg = shift}; + +my $mech = WWW::Mechanize->new(); +do { + $mech->get ($server->url); + like($carpmsg, qr/bad.*select/i, "Standard mech chokes on bogus HTML"); +}; + +# If at first you don't succeed, try with a shorter bungee... +undef $carpmsg; +$mech = MyMech->new(); +isa_ok( $mech, 'WWW::Mechanize', 'Derived object' ); + +my $response = $mech->get( $server->url ); +isa_ok( $response, 'HTTP::Response', 'Response I got back' ); +ok( $response->is_success, 'Got URL' ) or die "Can't even fetch local url"; +ok( $mech->is_html, "Local page is HTML" ); +ok(! $carpmsg, "No warnings this time"); + +my @forms = $mech->forms; +is( scalar @forms, 1, "One form" ); + +like($mech->content(), qr[/select], "alteration visible in ->content() too");

Thu Oct 21 23:16:48 2004 MARKSTOS [...] cpan.org - Severity Wishlist added

Thu Oct 21 23:22:34 2004 MARKSTOS [...] cpan.org - Taken

Thu Oct 21 23:28:40 2004 MARKSTOS [...] cpan.org - Correspondence added

[dom@idealx.com - Thu Oct 21 14:07:09 2004]: Show quoted text

> > Here comes another patch proposal against the current CVS (+ all four > patches that I have currently pending). What it does: > > ~ * encourage developpers to overload ->update_html($html) instead > ~ of ->_parse_html(). Since update_html() takes an argument, and > ~ as outlined in the pod, this allows one to straightforwardly > ~ provide an arbitrary HTML filter to the mech: > > ~ sub My::Subclass::update_html { > ~ my ($self, $html) = @_; > ~ $html =~ s[</option>.?.?.?</td>][</option></select></td>]isg; > ~ $self->WWW::Mechanize::update_html( $html ); > ~ } > > ~ Of course the old ->_parse_html() is still called for > compatibility (I've seen more than one instance of folks who need that > to continue to work), even though in the new base class' version it > does nothing. > > ~ * Resubmit part of the extensions proposed by Nicolas Thiery in > ~ July, and OKed by Mark Stosberg - but there were no tests by > ~ then :-) and Mark raised issues with the API, that I took into > ~ consideration this time around. We now have > ~ $mech->content("text") (optional dependency to > HTML::TreeBuilder > ~ - I plan to use that too for some of my future work, if this is > ~ not acceptable even as an optional dependency please let me > ~ know), and $mech->content(base_href => $url). Please credit > ~ Nicolas Thiery for both.

Dominique, Thanks again for for another patch. This patch looks generally in order. Would you mind tweaking the API for 'content' a little bit? I'm afraid that passing in argument or two won't scale up well over time. I think passing an a hash would be consistent with other parts of the Mech API and give us more flexibility to expand this over time. So how about: format => 'text', base_href => $url, Also, just to make my life easier, update from CVS again so you can produce a smaller patch. I think update_html will be an improvement over _parse_html. Thanks, Mark

Thu Oct 21 23:28:44 2004 MARKSTOS [...] cpan.org - Status changed from 'new' to 'open'

Sat Oct 23 21:09:15 2004 MARKSTOS [...] cpan.org - Correspondence added

[dom@idealx.com - Fri Oct 22 13:21:23 2004]: Show quoted text

> -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Mark Stosberg wrote: > > | I think passing an a hash would be consistent with other parts of > | the Mech API and give us more flexibility to expand this over time. > | > Attached is my revisited patch that does what you say.

Applied and commited. I did update your docs just a bit to leave open the possibility for other uses of content() with different content types in the future: "Passing arguments to content() if the current document is not HTML has no effect now (i.e. the return value is the same as $self->response()->content()). This may change in the future, but will likely be backwards-compatible when it does. " Thanks again. Mark

Sat Oct 23 21:09:17 2004 MARKSTOS [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #8087 for WWW-Mechanize: [Enhancement] new ->content() features, HTML filter policy

Preferred bug tracker