Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Web-Scraper CPAN distribution.

Report information
The Basics
Id: 85443
Status: open
Priority: 0/
Queue: Web-Scraper

People
Owner: Nobody in particular
Requestors: ipluta [...] wp.pl
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: don't call decoded_content if content is already unicode encoded
When using URI or HTTP::Response object as an argument to scrape(), simply $stuff->content should be used as $html content, in place of unconditional $stuff->decoded_content if $stuff->content is already utf-8 encoded. "wide character" errors may follow, otherwise. Here's a patch ($VERSION = '0.37'): diff --git a/lib/Web/Scraper.pm b/lib/Web/Scraper.pm index aca019c..7ad9b7f 100644 --- a/lib/Web/Scraper.pm +++ b/lib/Web/Scraper.pm @@ -64,7 +64,10 @@ sub scrape { return $self->scrape($res, $stuff->as_string); } elsif (blessed($stuff) && $stuff->isa('HTTP::Response')) { if ($stuff->is_success) { - $html = $stuff->decoded_content; + $html = + $stuff->content_charset =~ /utf\-8/i + ? $stuff->content + : $stuff->decoded_content; } else { croak "GET " . $stuff->request->uri . " failed: ", $stuff->status_line; }
Subject: Re: [rt.cpan.org #85443] don't call decoded_content if content is already unicode encoded
Date: Sun, 19 May 2013 11:42:56 -0700
To: bug-Web-Scraper [...] rt.cpan.org
From: Tatsuhiko Miyagawa <miyagawa [...] gmail.com>
On Sun, May 19, 2013 at 7:59 AM, Ireneusz Pluta via RT <bug-Web-Scraper@rt.cpan.org> wrote: Show quoted text
> When using URI or HTTP::Response object as an argument to scrape(), simply $stuff->content should be used as $html content, in place of unconditional $stuff->decoded_content if $stuff->content is already utf-8 encoded. "wide character" errors may follow, otherwise.
You might not understand what `decode_content` does since if the content is utf-8 "encoded", decoding them is obviously the right thing to do. If you have "Wide character" warnings (not errors I assume) elsewhere that sounds like more of an issue that has to be fixed there, not inside Web::Scraper like this. Show quoted text
> > Here's a patch ($VERSION = '0.37'): > > diff --git a/lib/Web/Scraper.pm b/lib/Web/Scraper.pm > index aca019c..7ad9b7f 100644 > --- a/lib/Web/Scraper.pm > +++ b/lib/Web/Scraper.pm > @@ -64,7 +64,10 @@ sub scrape { > return $self->scrape($res, $stuff->as_string); > } elsif (blessed($stuff) && $stuff->isa('HTTP::Response')) { > if ($stuff->is_success) { > - $html = $stuff->decoded_content; > + $html = > + $stuff->content_charset =~ /utf\-8/i > + ? $stuff->content > + : $stuff->decoded_content; > } else { > croak "GET " . $stuff->request->uri . " failed: ", $stuff->status_line; > } > > >
-- Tatsuhiko Miyagawa
From: ipluta [...] wp.pl
On Nd 19 Maj 2013, 14:43:35, miyagawa@gmail.com wrote: Show quoted text
> You might not understand what `decode_content` does since if the > content is utf-8 "encoded", decoding them is obviously the right thing > to do. > > If you have "Wide character" warnings (not errors I assume) elsewhere > that sounds like more of an issue that has to be fixed there, not > inside Web::Scraper like this.
Tatsuhiko, thanks for your response. That's true that my understanding of Perl unicode stuff is somewhat behind of what it should be :-). Anyway, could you please take a look at the following paste of session with your bin/scraper interactive utility, scraping a fragment of Polish Perl Mongers site? Note the "wide character" warning at 'y' command: $ scraper http://warszawa.pm.org/ Show quoted text
scraper> process 'p', 'p', 'text'; scraper> y
Wide character in warn at /usr/local/perl/bin/scraper line 70. --- p: 'Grupa Warszawa.pm składa się z osób zajmujących się zawodowo lub hobby’stycznie językiem Perl, dynamicznymi językami programowania oraz całym mnóstwem zagadnień mniej lub bardziej związanych ze społecznością języka Perl i open source. Jednak, żeby być szczerym, trzeba powiedzieć, iż całego czasu wolnego nie spędzamy rozwiązując zagadki programistyczne, o czym świadczą choćby nasze spotkania!' Show quoted text
scraper> c
#!/usr/local/perl-5.16.3/bin/perl use strict; use Web::Scraper; use URI; my $uri = URI->new("http://warszawa.pm.org/"); my $scraper = scraper { process 'p', 'p', 'text'; }; my $result = $scraper->scrape($uri); Show quoted text
scraper>
Subject: Re: [rt.cpan.org #85443] don't call decoded_content if content is already unicode encoded
Date: Sun, 19 May 2013 13:13:06 -0700
To: bug-Web-Scraper [...] rt.cpan.org
From: Tatsuhiko Miyagawa <miyagawa [...] gmail.com>
That's just a warning that tries to "warn" decoded strings in Unicode to the terminal, and you can totally ignore it. On Sun, May 19, 2013 at 1:10 PM, Ireneusz Pluta via RT <bug-Web-Scraper@rt.cpan.org> wrote: Show quoted text
> Queue: Web-Scraper > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=85443 > > > On Nd 19 Maj 2013, 14:43:35, miyagawa@gmail.com wrote:
>> You might not understand what `decode_content` does since if the >> content is utf-8 "encoded", decoding them is obviously the right thing >> to do. >> >> If you have "Wide character" warnings (not errors I assume) elsewhere >> that sounds like more of an issue that has to be fixed there, not >> inside Web::Scraper like this.
> > Tatsuhiko, > > thanks for your response. That's true that my understanding of Perl unicode stuff is somewhat behind of what it should be :-). > > Anyway, could you please take a look at the following paste of session with your bin/scraper interactive utility, scraping a fragment of Polish Perl Mongers site? Note the "wide character" warning at 'y' command: > > $ scraper http://warszawa.pm.org/
> scraper> process 'p', 'p', 'text'; > scraper> y
> Wide character in warn at /usr/local/perl/bin/scraper line 70. > --- > p: 'Grupa Warszawa.pm składa się z osób zajmujących się zawodowo lub hobby'stycznie językiem Perl, dynamicznymi językami programowania oraz całym mnóstwem zagadnień mniej lub bardziej związanych ze społecznością języka Perl i open source. Jednak, żeby być szczerym, trzeba powiedzieć, iż całego czasu wolnego nie spędzamy rozwiązując zagadki programistyczne, o czym świadczą choćby nasze spotkania!'
> scraper> c
> #!/usr/local/perl-5.16.3/bin/perl > use strict; > use Web::Scraper; > use URI; > > my $uri = URI->new("http://warszawa.pm.org/"); > my $scraper = scraper { > process 'p', 'p', 'text'; > }; > my $result = $scraper->scrape($uri);
> scraper>
-- Tatsuhiko Miyagawa