Subject: | Bug with some URLs images |
Date: | Mon, 2 Apr 2012 13:17:35 +0100 |
To: | bug-HTML-Miner [...] rt.cpan.org |
From: | Andy Newby <andy.newby [...] gmail.com> |
Hi Harish,
I'm not sure if you remember me. I'm the one who worked with you to get
JS/CSS files extracted in HTML::Miner (quite a while back)
I seem to have come across a problem with the module, which I can't work my
head around. For some reason, this URL:
http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1
Seems to have some problems with some of the images. For example:
$VAR1 = {
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-03-lg._V134457777_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-02._V134401297_.jpg'
=> 1,
'
http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-bottom._V153941350_.gif'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V167145160_.gif'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-04._V134401302_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK-15._V202471918_.png'
=> 1,
'
http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-center._V153941350_.gif'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-01._V134401302_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-01._V134401302_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/common-assets/play-btn-off._V151901884_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-04._V134401297_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-main-lg._V134401297_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-02._V134401297_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-04-lg._V134401297_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-main-01._V134401297_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/tequila/dp/eink-text._V138789301_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/x-locale/communities/customerimage/play-shuttle-off._V192200344_.gif'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-05-lg._V134401296_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-03._V134401296_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-features-06._V134401296_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-aag-tn-03._V134401302_.jpg'
=> 1,
'
http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-top._V153941344_.gif'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-tech-01._V134401296_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/whitney/dp/uk-kw-slate-02-lg._V134401297_.jpg'
=> 1,
'
http://g-ecx.images-amazon.com/images/G/02/kindle/www/myk/icon_spinner._V192252841_.gif'
=> 1
};
Some of these are wrong (I think its due to some images using https , and
others not) ... example:
http://www.amazon.co.uk/Kindle-Touch-Wi-Fi-Screen-Display/dp/B005890FUI/ref=amb_link_163489267_3?pf_rd_m=A3P5ROKL5A1OLE&pf_rd_s=center-1/https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-top._V153941344_.gif
Should be:
https://images-na.ssl-images-amazon.com/images/G/02/kindle/tequila/comp-arrow-top._V153941344_.gif
Do you have any ideas on a fix for this? I'll keep playing around with it,
but seeing as its you code (and its a bug :)), I thought I would give you a
shout to see if you have any suggestions. I'm simply calling with:
my $html_miner = HTML::Miner->new (
CURRENT_URL => $URL,
CURRENT_URL_HTML => \$html
);
my $images = $html_miner->get_images();
my $meta_data = $html_miner->get_meta_elements() ;
TIA
--
Andy Newby
andy@ultranerds.co.uk