Skip Menu |

This queue is for tickets about the HTML-Miner CPAN distribution.

Report information
The Basics
Id: 76236
Status: resolved
Priority: 0/
Queue: HTML-Miner

People
Owner: Nobody in particular
Requestors: andy.newby [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bug with some URLs images
Date: Mon, 2 Apr 2012 13:17:35 +0100
To: bug-HTML-Miner [...] rt.cpan.org
From: Andy Newby <andy.newby [...] gmail.com>
Hi Harish, I'm not sure if you remember me. I'm the one who worked with you to get JS/CSS files extracted in HTML::Miner (quite a while back) I seem to have come across a problem with the module, which I can't work my head around. For some reason, this URL: http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1 Seems to have some problems with some of the images. For example: $VAR1 = { ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-03-lg._V134457777_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-02._V134401297_.jpg' => 1, ' http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-bottom._V153941350_.gif' => 1, ' http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V167145160_.gif' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-04._V134401302_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK-15._V202471918_.png' => 1, ' http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-center._V153941350_.gif' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-01._V134401302_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-01._V134401302_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/common-assets/play-btn-off._V151901884_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-04._V134401297_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-main-lg._V134401297_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-02._V134401297_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-04-lg._V134401297_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-main-01._V134401297_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/tequila/dp/eink-text._V138789301_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/x-locale/communities/customerimage/play-shuttle-off._V192200344_.gif' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-05-lg._V134401296_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-03._V134401296_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-06._V134401296_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-03._V134401302_.jpg' => 1, ' http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-top._V153941344_.gif' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-tech-01._V134401296_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-02-lg._V134401297_.jpg' => 1, ' http://g-ecx.images-amazon.com/images/G/02/kindle/www/myk/icon_spinner._V192252841_.gif' => 1 }; Some of these are wrong (I think its due to some images using https , and others not) ... example: http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-top._V153941344_.gif Should be: https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-top._V153941344_.gif Do you have any ideas on a fix for this? I'll keep playing around with it, but seeing as its you code (and its a bug :)), I thought I would give you a shout to see if you have any suggestions. I'm simply calling with: my $html_miner = HTML::Miner->new ( CURRENT_URL => $URL, CURRENT_URL_HTML => \$html ); my $images = $html_miner->get_images(); my $meta_data = $html_miner->get_meta_elements() ; TIA -- Andy Newby andy@ultranerds.co.uk
Hi Andy, Here is the fix, I will update the module itself in a day or so. Miner.pm line 840 Change: if( $possible_reletive_url =~ /http:\/\// ) { to if( $possible_reletive_url =~ /http(s)?:\/\// ) { Also, _get_redirect_url is not thread safe. I intend to make that update in the next release as well. On Mon Apr 02 08:17:47 2012, andy.newby@gmail.com wrote: Show quoted text
> Hi Harish, > > I'm not sure if you remember me. I'm the one who worked with you to > get > JS/CSS files extracted in HTML::Miner (quite a while back) > > I seem to have come across a problem with the module, which I can't > work my > head around. For some reason, this URL: > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- >
Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center- Show quoted text
> 1 > > Seems to have some problems with some of the images. For example: > > $VAR1 = { > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > slate-03-lg._V134457777_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > features-02._V134401297_.jpg' > => 1, > ' > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- >
Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center- Show quoted text
> 1/https://images-na.ssl-images- > amazon.com/images/G/02/kindle/tequila/comp-arrow- > bottom._V153941350_.gif' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/x- > locale/common/transparent-pixel._V167145160_.gif' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > aag-tn-04._V134401302_.jpg' > => 1, > ' > http://g-ecx.images- > amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK- > 15._V202471918_.png' > => 1, > ' > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- >
Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center- Show quoted text
> 1/https://images-na.ssl-images- > amazon.com/images/G/02/kindle/tequila/comp-arrow- > center._V153941350_.gif' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > features-01._V134401302_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > aag-tn-01._V134401302_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/common-assets/play- > btn-off._V151901884_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > features-04._V134401297_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > slate-main-lg._V134401297_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > aag-tn-02._V134401297_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > slate-04-lg._V134401297_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > aag-main-01._V134401297_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/tequila/dp/eink- > text._V138789301_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/x- > locale/communities/customerimage/play-shuttle-off._V192200344_.gif' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > slate-05-lg._V134401296_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > features-03._V134401296_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > features-06._V134401296_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > aag-tn-03._V134401302_.jpg' > => 1, > ' > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- >
Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center- Show quoted text
> 1/https://images-na.ssl-images- > amazon.com/images/G/02/kindle/tequila/comp-arrow- > top._V153941344_.gif' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > tech-01._V134401296_.jpg' > => 1, > ' > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > slate-02-lg._V134401297_.jpg' > => 1, > ' > http://g-ecx.images- > amazon.com/images/G/02/kindle/www/myk/icon_spinner._V192252841_.gif' > => 1 > }; > > Some of these are wrong (I think its due to some images using https , > and > others not) ... example: > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- >
Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center- Show quoted text
> 1/https://images-na.ssl-images- > amazon.com/images/G/02/kindle/tequila/comp-arrow- > top._V153941344_.gif > > Should be: > > https://images-na.ssl-images- > amazon.com/images/G/02/kindle/tequila/comp-arrow- > top._V153941344_.gif > > Do you have any ideas on a fix for this? I'll keep playing around with > it, > but seeing as its you code (and its a bug :)), I thought I would give > you a > shout to see if you have any suggestions. I'm simply calling with: > > my $html_miner = HTML::Miner->new ( > CURRENT_URL => $URL, > CURRENT_URL_HTML => \$html > ); > > my $images = $html_miner->get_images(); > my $meta_data = $html_miner->get_meta_elements() ; > > > TIA
Subject: Re: [rt.cpan.org #76236] Bug with some URLs images
Date: Mon, 2 Apr 2012 14:28:50 +0100
To: bug-HTML-Miner [...] rt.cpan.org
From: Andy Newby <andy.newby [...] gmail.com>
Hi, Thanks - that works like a charm. Can't believe I missed that bit! A couple of other tweaks I did, was to make the passed in HTML a reference: CURRENT_URL_HTML => \$html Then in your code, when setting up I did: $require_extract = 1 unless( ${$parameter_hash{ CURRENT_URL_HTML}} ) ; ..and: CURRENT_URL_HTML => ${$parameter_hash{ CURRENT_URL_HTML}} , This just saves the module having to make a copy of the HTML. Not sure if you wanna add that in =) Also,I found it helpful to try and grab the dimenions of the image when possible. I simply do this by looking for the width="" and height="" params, ie: if( $complete_image_link =~ m/width=[\'\"](\d+)[\'\"]/is ) { $this_width = $1; } if( $complete_image_link =~ m/height=[\'\"](\d+)[\'\"]/is ) { $this_height = $1; } ...and of course then sending the values into the results loop Anyway, thanks again for the quick reply - much appreciated Andy n, Apr 2, 2012 at 2:22 PM, Harish T Madabushi via RT < bug-HTML-Miner@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=76236 > > > Hi Andy, > > Here is the fix, I will update the module itself in a day or so. > > Miner.pm line 840 > > Change: > > if( $possible_reletive_url =~ /http:\/\// ) { > > to > > if( $possible_reletive_url =~ /http(s)?:\/\// ) { > > > > > Also, _get_redirect_url is not thread safe. I intend to make that update > in the next release as well. > > > > > > > On Mon Apr 02 08:17:47 2012, andy.newby@gmail.com wrote:
> > Hi Harish, > > > > I'm not sure if you remember me. I'm the one who worked with you to > > get > > JS/CSS files extracted in HTML::Miner (quite a while back) > > > > I seem to have come across a problem with the module, which I can't > > work my > > head around. For some reason, this URL: > > > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- > >
> > Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-
> > 1 > > > > Seems to have some problems with some of the images. For example: > > > > $VAR1 = { > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > slate-03-lg._V134457777_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > features-02._V134401297_.jpg' > > => 1, > > ' > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- > >
> > Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-
> > 1/https://images-na.ssl-images- > > amazon.com/images/G/02/kindle/tequila/comp-arrow- > > bottom._V153941350_.gif' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/x- > > locale/common/transparent-pixel._V167145160_.gif' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > aag-tn-04._V134401302_.jpg' > > => 1, > > ' > > http://g-ecx.images- > > amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK- > > 15._V202471918_.png' > > => 1, > > ' > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- > >
> > Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-
> > 1/https://images-na.ssl-images- > > amazon.com/images/G/02/kindle/tequila/comp-arrow- > > center._V153941350_.gif' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > features-01._V134401302_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > aag-tn-01._V134401302_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/common-assets/play- > > btn-off._V151901884_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > features-04._V134401297_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > slate-main-lg._V134401297_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > aag-tn-02._V134401297_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > slate-04-lg._V134401297_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > aag-main-01._V134401297_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/tequila/dp/eink- > > text._V138789301_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/x- > > locale/communities/customerimage/play-shuttle-off._V192200344_.gif' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > slate-05-lg._V134401296_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > features-03._V134401296_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > features-06._V134401296_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > aag-tn-03._V134401302_.jpg' > > => 1, > > ' > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- > >
> > Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-
> > 1/https://images-na.ssl-images- > > amazon.com/images/G/02/kindle/tequila/comp-arrow- > > top._V153941344_.gif' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > tech-01._V134401296_.jpg' > > => 1, > > ' > > http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw- > > slate-02-lg._V134401297_.jpg' > > => 1, > > ' > > http://g-ecx.images- > > amazon.com/images/G/02/kindle/www/myk/icon_spinner._V192252841_.gif' > > => 1 > > }; > > > > Some of these are wrong (I think its due to some images using https , > > and > > others not) ... example: > > > > http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen- > >
> > Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-
> > 1/https://images-na.ssl-images- > > amazon.com/images/G/02/kindle/tequila/comp-arrow- > > top._V153941344_.gif > > > > Should be: > > > > https://images-na.ssl-images- > > amazon.com/images/G/02/kindle/tequila/comp-arrow- > > top._V153941344_.gif > > > > Do you have any ideas on a fix for this? I'll keep playing around with > > it, > > but seeing as its you code (and its a bug :)), I thought I would give > > you a > > shout to see if you have any suggestions. I'm simply calling with: > > > > my $html_miner = HTML::Miner->new ( > > CURRENT_URL => $URL, > > CURRENT_URL_HTML => \$html > > ); > > > > my $images = $html_miner->get_images(); > > my $meta_data = $html_miner->get_meta_elements() ; > > > > > > TIA
> > > >
-- Andy Newby andy@ultranerds.co.uk

Message body is not shown because it is too large.

Cleaned POD, included thread safe and fixed bugs in V1.02