Skip Menu |

This queue is for tickets about the WWW-Scripter-Plugin-JavaScript CPAN distribution.

Report information
The Basics
Id: 75696
Status: resolved
Priority: 0/
Queue: WWW-Scripter-Plugin-JavaScript

People
Owner: Nobody in particular
Requestors: j [...] hulala.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Bug(s) Report
Date: Sun, 11 Mar 2012 16:14:17 -0400
To: "bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org" <bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org>
From: "J. Hulala" <j [...] hulala.com>
To Whom It May Concern: This is a report of possible bug(s) in WWW::Scripter::Plugin::JavaScript module. I would appreciate if you can confirm receipt and if you are unable to fix it, email me back some alternative solution, if available. (1) When JavaScript includes code such as Show quoted text
>> a = window.history.length; <<
then such statement is not evaluated, as window.history is undefined. (2) When image is created dynamically with JavaScript, such as Show quoted text
>> a = new Image; a.src = 'http://...'; <<
then such image is not included in the images array of $w->images. (3) I would like to bring to your attention, that scripts are not "downloaded" with referrer information of the document that such script requested, which leads in some cases to obtaining of invalid content, as the content of some scripts are dynamically generated on host servers also based on referrer information. --------- Please reply with your confirmation of receipt. Please do not list my name and email address on any public site or report. Thank you. Regards, John Hulala
On Sun Mar 11 16:14:27 2012, j@hulala.com wrote: Show quoted text
> To Whom It May Concern: > > This is a report of possible bug(s) in > WWW::Scripter::Plugin::JavaScript module. > I would appreciate if you can confirm receipt and if you are unable to > fix it, email me back some alternative solution, if available. > > (1) > When JavaScript includes code such as >
> >> a = window.history.length; <<
> > then such statement is not evaluated, as window.history is > undefined.
As I have just mentioned in private e-mail, this is fixed in WWW::Scripter 0.026. I still have to look at the other two issues, but have been a bit busy lately. :-) Thank you for the report. Show quoted text
> > (2) > When image is created dynamically with JavaScript, such as >
> >> a = new Image; a.src = 'http://...'; <<
> > then such image is not included in the images array of $w-
> >images.
> > (3) > I would like to bring to your attention, that scripts are not > "downloaded" with referrer information > of the document that such script requested, which leads in some > cases to obtaining of invalid > content, as the content of some scripts are dynamically > generated on host servers also based > on referrer information.
On Sun Mar 11 16:14:27 2012, j@hulala.com wrote: Show quoted text
> (2) > When image is created dynamically with JavaScript, such as >
> >> a = new Image; a.src = 'http://...'; <<
> > then such image is not included in the images array of $w-
> >images.
I don’t consider this a bug, as the image is not part of the DOM tree. What is your use case? I might be persuaded otherwise if you could tell me what it is. Show quoted text
> > (3) > I would like to bring to your attention, that scripts are not > "downloaded" with referrer information > of the document that such script requested, which leads in some > cases to obtaining of invalid > content, as the content of some scripts are dynamically > generated on host servers also based > on referrer information.
I’ve just fixed that in version 0.027.
CC: "sprout [...] cpan.org" <sprout [...] cpan.org>
Subject: RE: [rt.cpan.org #75696] Bug(s) Report
Date: Sun, 18 Mar 2012 17:27:52 -0400
To: "bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org" <bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org>
From: "J. Hulala" <j [...] hulala.com>
Show quoted text
>> When image is created dynamically with JavaScript, such as >>
>> >> a = new Image; a.src = 'http://...'; <<
>> >> then such image is not included in the images array of >> $w->images.
Show quoted text
> I don’t consider this a bug, as the image is not part of the > DOM tree. What is your use case? I might be persuaded > otherwise if you could tell me what it is.
There should be an option automatically process GET for all images when requested document (and its frames/iframes/scripts) contains them. This should include images created dynamically with JavaScript, as described above, even such images are not part of the DOM tree. Very important part of such GET request(s) is correct referrer. Reason for such feature is that in some cases images are used not for image content, but for tracking purpose, including cookies, url parameters, etc. If such images is not requested, or is requested with incorrect cookies or incorrect referrer, a follow of some link (virtual click on link) might lead to incorrect content, as remote server might return proper content only if correct cookies or other tracking were processed prior such request. I spent weeks with testing and I made some temporary changes in your module to be able to come to such conclusion. I hope you understand my point and will at some point release fix for this as well. Thank you.
On Sun Mar 18 17:28:02 2012, j@hulala.com wrote: Show quoted text
> >> When image is created dynamically with JavaScript, such as > >>
> >> >> a = new Image; a.src = 'http://...'; <<
> >> > >> then such image is not included in the images array of > >> $w->images.
>
> > I don’t consider this a bug, as the image is not part of the > > DOM tree. What is your use case? I might be persuaded > > otherwise if you could tell me what it is.
> > > > There should be an option automatically process GET for all images > when > > requested document (and its frames/iframes/scripts) contains them. > > > > This should include images created dynamically with JavaScript, as > > described above, even such images are not part of the DOM tree. > > > > Very important part of such GET request(s) is correct referrer. > > Reason for such feature is that in some cases images are used not for > > image content, but for tracking purpose, including cookies, url > parameters, > > etc. If such images is not requested, or is requested with incorrect > cookies > > or incorrect referrer, a follow of some link (virtual click on link) > might lead > > to incorrect content, as remote server might return proper content > only > > if correct cookies or other tracking were processed prior such > request. > > > > I spent weeks with testing and I made some temporary changes in your > > module to be able to come to such conclusion. I hope you understand my > > point and will at some point release fix for this as well.
Now that is an interesting use case. Currently WWW::Scripter doesn’t fetch any images at all. It acts like a browser with images turned off. This makes most scraping faster, so I don’t want to make fetching images the default. However, I’m willing to provide it as an option. What would you suggest as the interface? Perhaps a boolean option to the constructor (fetch_images => 1) and a corresponding accessor method?
Subject: RE: [rt.cpan.org #75696] Bug(s) Report
Date: Sun, 18 Mar 2012 17:54:41 -0400
To: "bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org" <bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org>
From: "J. Hulala" <j [...] hulala.com>
Show quoted text
> Now that is an interesting use case. Currently WWW::Scripter doesn’t fetch
Show quoted text
> any images at all. It acts like a browser with images turned off. This makes most
Show quoted text
> scraping faster, so I don’t want to make fetching images the default. > > However, I’m willing to provide it as an option. What would you suggest as the
Show quoted text
> interface? Perhaps a boolean option to the constructor (fetch_images => 1) and
Show quoted text
> a corresponding accessor method?
Well, fetch_images => 1 is nice option, but technically someone might like to get image content or response headers as well, so callback function in some cases is better idea. But for most cases 0/1 is enough. Thanks!
On Sun Mar 18 17:59:30 2012, j@hulala.com wrote: Show quoted text
> > Now that is an interesting use case. Currently WWW::Scripter
> doesn’t fetch >
> > any images at all. It acts like a browser with images turned off.
> This makes most >
> > scraping faster, so I don’t want to make fetching images the
> default.
> > > > However, I’m willing to provide it as an option. What would you
> suggest as the >
> > interface? Perhaps a boolean option to the constructor (fetch_images
> => 1) and >
> > a corresponding accessor method?
> > > > Well, fetch_images => 1 is nice option, but technically someone might > like to get > > image content or response headers as well, so callback function in > some cases is > > better idea. But for most cases 0/1 is enough.
Sorry for the delay. I’ve been a bit busy (as usual, I’m afraid), but I’ve been thinking about this. I *might* get time tomorrow to implement this. Do you think having fetch_images => 1 and also image_handler => sub { ... } would be a good interface? Should image_handler imply fetch_images => 1? It might be conceptually easier to understand if image_handler is a no-op without fetch_images, but that would make code that uses it more verbose. image_handler could be passed references to (0) an HTTP::Response object, (1) the img element, and (2) the WWW::Scripter object. I’m not sure about the order of the arguments. Actually, it should probably be the reverse of that, for similarity with event2sub. Passing the WWW::Scripter object is not strictly necessary, as it is possible to get it with $img->ownerDocument->defaultView. But how many people know that?
Subject: RE: [rt.cpan.org #75696] Bug(s) Report
Date: Sun, 1 Apr 2012 13:10:42 +0000
To: "bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org" <bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org>
From: "J. Hulala" <j [...] hulala.com>
Great. Yes, it would be good interface, at least for now... Thank you! Show quoted text
________________________________________ From: Father Chrysostomos via RT [bug-WWW-Scripter-Plugin-JavaScript@rt.cpan.org] Sent: Saturday, March 31, 2012 9:48 PM To: J. Hulala Subject: [rt.cpan.org #75696] Bug(s) Report <URL: https://rt.cpan.org/Ticket/Display.html?id=75696 > On Sun Mar 18 17:59:30 2012, j@hulala.com wrote:
> > Now that is an interesting use case. Currently WWW::Scripter
> doesn’t fetch >
> > any images at all. It acts like a browser with images turned off.
> This makes most >
> > scraping faster, so I don’t want to make fetching images the
> default.
> > > > However, I’m willing to provide it as an option. What would you
> suggest as the >
> > interface? Perhaps a boolean option to the constructor (fetch_images
> => 1) and >
> > a corresponding accessor method?
> > > > Well, fetch_images => 1 is nice option, but technically someone might > like to get > > image content or response headers as well, so callback function in > some cases is > > better idea. But for most cases 0/1 is enough.
Sorry for the delay. I’ve been a bit busy (as usual, I’m afraid), but I’ve been thinking about this. I *might* get time tomorrow to implement this. Do you think having fetch_images => 1 and also image_handler => sub { ... } would be a good interface? Should image_handler imply fetch_images => 1? It might be conceptually easier to understand if image_handler is a no-op without fetch_images, but that would make code that uses it more verbose. image_handler could be passed references to (0) an HTTP::Response object, (1) the img element, and (2) the WWW::Scripter object. I’m not sure about the order of the arguments. Actually, it should probably be the reverse of that, for similarity with event2sub. Passing the WWW::Scripter object is not strictly necessary, as it is possible to get it with $img->ownerDocument->defaultView. But how many people know that?
On Sun Apr 01 09:11:11 2012, j@hulala.com wrote: Show quoted text
> Great. Yes, it would be good interface, at least for now... > > Thank you!
I’ve just uploaded WWW::Scripter version 0.028. I didn’t make these constructor arguments, for various reasons; but that could always change later. In adding this feature, I found a bug in HTML::DOM, which is fixed in 0.053. WWW::Scripter 0.028 requires that version. Show quoted text
> > ________________________________________ > From: Father Chrysostomos via RT [bug-WWW-Scripter-Plugin- > JavaScript@rt.cpan.org] > Sent: Saturday, March 31, 2012 9:48 PM > To: J. Hulala > Subject: [rt.cpan.org #75696] Bug(s) Report > > <URL: https://rt.cpan.org/Ticket/Display.html?id=75696 > > > On Sun Mar 18 17:59:30 2012, j@hulala.com wrote:
> > > Now that is an interesting use case. Currently WWW::Scripter
> > doesn’t fetch > >
> > > any images at all. It acts like a browser with images turned off.
> > This makes most > >
> > > scraping faster, so I don’t want to make fetching images the
> > default.
> > > > > > However, I’m willing to provide it as an option. What would you
> > suggest as the > >
> > > interface? Perhaps a boolean option to the constructor
> (fetch_images
> > => 1) and > >
> > > a corresponding accessor method?
> > > > > > > > Well, fetch_images => 1 is nice option, but technically someone
> might
> > like to get > > > > image content or response headers as well, so callback function in > > some cases is > > > > better idea. But for most cases 0/1 is enough.
> > Sorry for the delay. I’ve been a bit busy (as usual, I’m afraid), but > I’ve been thinking about > this. I *might* get time tomorrow to implement this. > > Do you think having fetch_images => 1 and also image_handler => sub { > ... } would be a > good interface? > > Should image_handler imply fetch_images => 1? It might be > conceptually easier to > understand if image_handler is a no-op without fetch_images, but that > would make code that > uses it more verbose. > > image_handler could be passed references to (0) an HTTP::Response > object, (1) the img > element, and (2) the WWW::Scripter object. > > I’m not sure about the order of the arguments. Actually, it should > probably be the reverse of > that, for similarity with event2sub. > > Passing the WWW::Scripter object is not strictly necessary, as it is > possible to get it with > $img->ownerDocument->defaultView. But how many people know that?
On Mon Apr 02 00:08:28 2012, SPROUT wrote: Show quoted text
> On Sun Apr 01 09:11:11 2012, j@hulala.com wrote:
> > Great. Yes, it would be good interface, at least for now... > > > > Thank you!
> > I’ve just uploaded WWW::Scripter version 0.028. I didn’t make these > constructor arguments, > for various reasons; but that could always change later. > > In adding this feature, I found a bug in HTML::DOM, which is fixed in > 0.053. WWW::Scripter > 0.028 requires that version.
I forgot to mention that I made image_handler a no-op without fetch_images set to true, for simplicity’s sake. Show quoted text
>
> > > > ________________________________________ > > From: Father Chrysostomos via RT [bug-WWW-Scripter-Plugin- > > JavaScript@rt.cpan.org] > > Sent: Saturday, March 31, 2012 9:48 PM > > To: J. Hulala > > Subject: [rt.cpan.org #75696] Bug(s) Report > > > > <URL: https://rt.cpan.org/Ticket/Display.html?id=75696 > > > > > On Sun Mar 18 17:59:30 2012, j@hulala.com wrote:
> > > > Now that is an interesting use case. Currently WWW::Scripter
> > > doesn’t fetch > > >
> > > > any images at all. It acts like a browser with images turned
> off.
> > > This makes most > > >
> > > > scraping faster, so I don’t want to make fetching images the
> > > default.
> > > > > > > > However, I’m willing to provide it as an option. What would you
> > > suggest as the > > >
> > > > interface? Perhaps a boolean option to the constructor
> > (fetch_images
> > > => 1) and > > >
> > > > a corresponding accessor method?
> > > > > > > > > > > > Well, fetch_images => 1 is nice option, but technically someone
> > might
> > > like to get > > > > > > image content or response headers as well, so callback function in > > > some cases is > > > > > > better idea. But for most cases 0/1 is enough.
> > > > Sorry for the delay. I’ve been a bit busy (as usual, I’m afraid),
> but
> > I’ve been thinking about > > this. I *might* get time tomorrow to implement this. > > > > Do you think having fetch_images => 1 and also image_handler => sub
> {
> > ... } would be a > > good interface? > > > > Should image_handler imply fetch_images => 1? It might be > > conceptually easier to > > understand if image_handler is a no-op without fetch_images, but
> that
> > would make code that > > uses it more verbose. > > > > image_handler could be passed references to (0) an HTTP::Response > > object, (1) the img > > element, and (2) the WWW::Scripter object. > > > > I’m not sure about the order of the arguments. Actually, it should > > probably be the reverse of > > that, for similarity with event2sub. > > > > Passing the WWW::Scripter object is not strictly necessary, as it is > > possible to get it with > > $img->ownerDocument->defaultView. But how many people know that?
> >
Subject: RE: [rt.cpan.org #75696] Bug(s) Report
Date: Mon, 2 Apr 2012 12:13:11 +0000
To: "bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org" <bug-WWW-Scripter-Plugin-JavaScript [...] rt.cpan.org>
From: "J. Hulala" <j [...] hulala.com>
Great news. Thank you! Show quoted text
________________________________________ From: Father Chrysostomos via RT [bug-WWW-Scripter-Plugin-JavaScript@rt.cpan.org] Sent: Monday, April 02, 2012 12:08 AM To: J. Hulala Subject: [rt.cpan.org #75696] Bug(s) Report <URL: https://rt.cpan.org/Ticket/Display.html?id=75696 > On Sun Apr 01 09:11:11 2012, j@hulala.com wrote:
> Great. Yes, it would be good interface, at least for now... > > Thank you!
I’ve just uploaded WWW::Scripter version 0.028. I didn’t make these constructor arguments, for various reasons; but that could always change later. In adding this feature, I found a bug in HTML::DOM, which is fixed in 0.053. WWW::Scripter 0.028 requires that version.
> > ________________________________________ > From: Father Chrysostomos via RT [bug-WWW-Scripter-Plugin- > JavaScript@rt.cpan.org] > Sent: Saturday, March 31, 2012 9:48 PM > To: J. Hulala > Subject: [rt.cpan.org #75696] Bug(s) Report > > <URL: https://rt.cpan.org/Ticket/Display.html?id=75696 > > > On Sun Mar 18 17:59:30 2012, j@hulala.com wrote:
> > > Now that is an interesting use case. Currently WWW::Scripter
> > doesn’t fetch > >
> > > any images at all. It acts like a browser with images turned off.
> > This makes most > >
> > > scraping faster, so I don’t want to make fetching images the
> > default.
> > > > > > However, I’m willing to provide it as an option. What would you
> > suggest as the > >
> > > interface? Perhaps a boolean option to the constructor
> (fetch_images
> > => 1) and > >
> > > a corresponding accessor method?
> > > > > > > > Well, fetch_images => 1 is nice option, but technically someone
> might
> > like to get > > > > image content or response headers as well, so callback function in > > some cases is > > > > better idea. But for most cases 0/1 is enough.
> > Sorry for the delay. I’ve been a bit busy (as usual, I’m afraid), but > I’ve been thinking about > this. I *might* get time tomorrow to implement this. > > Do you think having fetch_images => 1 and also image_handler => sub { > ... } would be a > good interface? > > Should image_handler imply fetch_images => 1? It might be > conceptually easier to > understand if image_handler is a no-op without fetch_images, but that > would make code that > uses it more verbose. > > image_handler could be passed references to (0) an HTTP::Response > object, (1) the img > element, and (2) the WWW::Scripter object. > > I’m not sure about the order of the arguments. Actually, it should > probably be the reverse of > that, for similarity with event2sub. > > Passing the WWW::Scripter object is not strictly necessary, as it is > possible to get it with > $img->ownerDocument->defaultView. But how many people know that?