Bug #2989 for WWW-Mechanize: Create an image extractor for Mech.

Sun Jul 20 17:11:15 2003 andy [...] petdance.com - Ticket created

Subject:

Create an image extractor for Mech.

I want to be able to get all the images on a page.

Thu Dec 04 15:15:36 2003 Guest - Correspondence added

From:

markstos [...] cpan.org

[PETDANCE - Sun Jul 20 17:11:15 2003]: Show quoted text

> I want to be able to get all the images on a page.

Some thoughts on this: - There may be a doc-bug relate to thing. In the docs for find_link's tag_regexp attribute, the "img" tag is mentioned. However, from reading the source, it will never be found. - find_all_images seems be easy. It would just redefine %urltags, and call _extract_links() with the new definition. I might refactor to make %urltags hidden behind an overridable method.

Thu Dec 04 15:27:13 2003 Guest - Correspondence added

From:

markstos [...] cpan.org

[guest - Thu Dec 4 15:15:36 2003]: Show quoted text

> [PETDANCE - Sun Jul 20 17:11:15 2003]: >

> > I want to be able to get all the images on a page.

I think this /is/ easy. Here's a proof of concept script below. It requires Mech to be patched slightly to expose %urltags. Here's the small patch: #### --- /usr/local/lib/perl5/site_perl/5.8.0/WWW/Mechanize.pm +++ /home/mark/tmp/WWW/Mechanize.pm @@ -1269,7 +1269,8 @@ =cut -my %urltags = ( +use vars qw/%urltags/; +%urltags = ( a => "href", area => "href", frame => "src", ################### Here's the proof of concept script: ### #!/usr/bin/perl use lib '/home/mark/tmp/'; use strict; use WWW::Mechanize; use Data::Dumper; %WWW::Mechanize::urltags = ( img => 'src', ); my $a = WWW::Mechanize->new(); $a->get('http://rt.cpan.org/'); print Dumper ( $a->links ); __END__ ###### I'm not suggesting it implementing quite like this, just demonstrating the the framework is already there to make this very easy.

Fri Jun 04 12:22:31 2004 Guest - Correspondence added

Subject:	Create an image extractor for Mech. (or at least fix img-related docs.)
From:	mark [...] summersault.com

[guest - Thu Dec 4 15:27:13 2003]: Show quoted text

> [guest - Thu Dec 4 15:15:36 2003]: >

> > [PETDANCE - Sun Jul 20 17:11:15 2003]: > >

> > > I want to be able to get all the images on a page.

> > I think this /is/ easy. Here's a proof of concept script below. It > requires Mech to be patched slightly to expose %urltags. Here's the > small patch: > #### > --- /usr/local/lib/perl5/site_perl/5.8.0/WWW/Mechanize.pm > +++ /home/mark/tmp/WWW/Mechanize.pm > @@ -1269,7 +1269,8 @@ > > =cut > > -my %urltags = ( > +use vars qw/%urltags/; > +%urltags = ( > a => "href", > area => "href", > frame => "src", > ################### > > Here's the proof of concept script: > > ### > > > #!/usr/bin/perl > > use lib '/home/mark/tmp/'; > use strict; > use WWW::Mechanize; > use Data::Dumper; > > %WWW::Mechanize::urltags = ( > img => 'src', > ); > > my $a = WWW::Mechanize->new(); > $a->get('http://rt.cpan.org/'); > print Dumper ( $a->links ); > > __END__ > > ###### > > I'm not suggesting it implementing quite like this, just demonstrating > the the framework is already there to make this very easy. >

Hello, I just ran into this issue again Today. I think the bug status should be elevated to Normal, or even 'Important'. The documentation demonstrates finding img links: $mech->find_link( tag_regex => qr/^(a|img)$/ However, per the above discussion, the current code will never find any img tags. I think perhaps there should be some flag to include images in all of the functions that 'find all links'. Or perhaps it would be cleaner to just have some additional img-specific functions.

Fri Jun 04 12:28:24 2004 Guest - Correspondence added

Subject:	[DOC PATCH[ Create an image extractor for Mech.
From:	mark [...] summersault.com

[guest - Fri Jun 4 12:22:31 2004]: Show quoted text

> [guest - Thu Dec 4 15:27:13 2003]: >

> > [guest - Thu Dec 4 15:15:36 2003]: > >

> > > [PETDANCE - Sun Jul 20 17:11:15 2003]: > > >

> > > > I want to be able to get all the images on a page.

> > > > I think this /is/ easy. Here's a proof of concept script below. It > > requires Mech to be patched slightly to expose %urltags. Here's the > > small patch: > > #### > > --- /usr/local/lib/perl5/site_perl/5.8.0/WWW/Mechanize.pm > > +++ /home/mark/tmp/WWW/Mechanize.pm > > @@ -1269,7 +1269,8 @@ > > > > =cut > > > > -my %urltags = ( > > +use vars qw/%urltags/; > > +%urltags = ( > > a => "href", > > area => "href", > > frame => "src", > > ################### > > > > Here's the proof of concept script: > > > > ### > > > > > > #!/usr/bin/perl > > > > use lib '/home/mark/tmp/'; > > use strict; > > use WWW::Mechanize; > > use Data::Dumper; > > > > %WWW::Mechanize::urltags = ( > > img => 'src', > > ); > > > > my $a = WWW::Mechanize->new(); > > $a->get('http://rt.cpan.org/'); > > print Dumper ( $a->links ); > > > > __END__ > > > > ###### > > > > I'm not suggesting it implementing quite like this, just

> demonstrating

> > the the framework is already there to make this very easy. > >

> > > Hello, > > I just ran into this issue again Today. I think the bug status should > be > elevated to Normal, or even 'Important'. The documentation > demonstrates > finding img links: > > $mech->find_link( tag_regex => qr/^(a|img)$/ > > However, per the above discussion, the current code will never find > any > img tags. > > I think perhaps there should be some flag to include images in all of > the functions that 'find all links'. Or perhaps it would be cleaner to > just have some additional img-specific functions.

Fri Jun 04 12:32:42 2004 Guest - Correspondence added

Subject:	[DOC PATCH] Create an image extractor for Mech.
From:	mark [...] summersault.com

Sorry if the last msg was blank. I had a browser spaz. Below is a doc patch as a quick fix for the current situation. Another idea: Do 'tag' and 'tag_regex' need to be limited to the same set of tags and attributes? If I search for links with these keys, I shouldn't be surprised be surprised if I get exactly what I ask for. All that's need for img support then is to add in some extra mappings somewhere that define any img tags, and their attributes that hold the URLS. This change to the 'tag' and 'tag_regex' attributes should be backwards compatible, and in fact would be bringing the code in compliance with the docs. --- /usr/local/lib/perl5/site_perl/5.8.0/WWW/Mechanize.pm Tue Apr 13 22:44:24 2004 +++ /home/mark/tmp/Mechanize.pm Fri Jun 4 11:27:15 2004 @@ -1000,6 +1000,17 @@ $mech->find_link( tag_regex => qr/^(a|img)$/; +Currently, the following tags are supported, with Mech looking +at these particular tag attributes: + + <a href=""> + <area href=""> + <frame src=""> + <iframe src=""> + <meta content=""> + +Other tags will be ignored. + =item * C<< n => number >> Matches against the I<n>th link.

Fri Dec 24 01:35:11 2004 andy [...] petdance.com - Status changed from 'new' to 'resolved'

Bug #2989 for WWW-Mechanize: Create an image extractor for Mech.

Preferred bug tracker