Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Web-Scraper CPAN distribution.

Report information
The Basics
Id: 29799
Status: open
Priority: 0/
Queue: Web-Scraper

People
Owner: Nobody in particular
Requestors: jmason [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.20
Fixed in: (no value)



Subject: <br> tag should create whitespace for TEXT type
hi! quick report -- probably easiest if I demo it. This scraper: use URI; use Web::Scraper; my $s_show = scraper { process "span.tableListing-date", date => 'TEXT'; }; my $starturl = "http://www.ticketmaster.ie/venue/198299"; my $res = $s_show->scrape( URI->new($starturl)); use Data::Dumper; die "JMD ".Dumper($res); runs against a Ticketmaster page with this HTML: <span class="tableListing-date">Sat 06/10/07<br>20:00</span></td> it should produce something like JMD $VAR1 = { 'date' => 'Sat 06/10/07 20:00' }; (or maybe with a \n.) instead it produces JMD $VAR1 = { 'date' => 'Sat 06/10/0720:00' }; note the missing whitespace in place of the <br>. Web::Scraper is great fun btw, I'm amazed how easy this is ;)
Subject: Re: [rt.cpan.org #29799] <br> tag should create whitespace for TEXT type
Date: Fri, 5 Oct 2007 16:34:59 -0700
To: bug-Web-Scraper [...] rt.cpan.org
From: "Tatsuhiko Miyagawa" <miyagawa [...] gmail.com>
Thanks for the report. I think it's a problem of HTML::Element because it just calls as_text method of HTML::Element. Make a report for the module? On 10/5/07, via RT <bug-Web-Scraper@rt.cpan.org> wrote: Show quoted text
> > Fri Oct 05 19:18:59 2007: Request 29799 was acted upon. > Transaction: Ticket created by JMASON > Queue: Web-Scraper > Subject: <br> tag should create whitespace for TEXT type > Broken in: 0.20 > Severity: Normal > Owner: Nobody > Requestors: JMASON@cpan.org > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=29799 > > > > hi! > > quick report -- probably easiest if I demo it. This scraper: > > use URI; > use Web::Scraper; > my $s_show = scraper { process "span.tableListing-date", date => > 'TEXT'; }; > my $starturl = "http://www.ticketmaster.ie/venue/198299"; > my $res = $s_show->scrape( URI->new($starturl)); > use Data::Dumper; die "JMD ".Dumper($res); > > runs against a Ticketmaster page with this HTML: > > <span class="tableListing-date">Sat > 06/10/07<br>20:00</span></td> > > it should produce something like > > JMD $VAR1 = { > 'date' => 'Sat 06/10/07 20:00' > }; > > (or maybe with a \n.) instead it produces > > JMD $VAR1 = { > 'date' => 'Sat 06/10/0720:00' > }; > > > note the missing whitespace in place of the <br>. > > Web::Scraper is great fun btw, I'm amazed how easy this is ;) >
-- Tatsuhiko Miyagawa
hmm, judging by the response on that bug, as_text my not be an appropriate method to use -- 'If I have a block of HTML 3, for example, that reads: <xmp><br></xmp> That <br> should not be converted, but a blind regexp engine would convert it. Beyond that, <br> is not the only element that would need this treatment. People expect the same with <hr> as well as <p>, <div>, <blockquote> and other block-level elements. as_text was never intended to be used as a sanitization method nor a display method - the man page specifically states that it is the concatenation of text elements as the tree is descended. Changing that is a design decision and won't be considered until the major version is bumped up to 4.0, which is down the road quite a ways.' I don't agree, but I can see his point to a degree. I guess some other way of rendering text blocks is necessary :(