Bug #4399 for HTML-Parser: Bug in HTML::PullParser / HTML::TokeParser

Fri Nov 14 12:11:48 2003 Guest - Ticket created

Subject:

Bug in HTML::PullParser / HTML::TokeParser

Hello, I am Arun Persad, using Activestate Perl 5.8 on Win NT. I think I've found a bug in HTML::PullParser / HTML::TokeParser - text tokens are not tokenized consistently. Some text tokens (which should count as a single token) are being split. You can see this inconsistency by looking at $VAR7 to $VAR12 in the output of the demo script. 'British Newspaper Index' is treated as a single chunk by HTML::Parser, but I don't see why HTML::TokeParser should split it into two tokens - it is a single token in the html source. Thanks, Arun # demo script use strict; use HTML::Parser; use HTML::PullParser; use HTML::TokeParser; use LWP::Simple; use Data::Dumper; my $Pat = qr/British|Newspaper/; my $url = 'http://www.bl.uk/collections/wider/eresources/title/eresourcesb.html'; my $content = get($url) or die "Couldn't get $url\n"; use_html_parser(); use_pullparser(); use_tokeparser(); # Test subs - ineach case, look for text tokens containing the word 'British' sub use_html_parser { print "Parsing with HTML::Parser ...\n"; my @text; my $p = HTML::Parser->new( api_version => 3, text_h => [ sub {push @text, $_[0] if $_[0] =~ /$Pat/}, "dtext" ] ); $p->parse($content) || die $!; print Dumper(@text), "\n"; } sub use_pullparser { print "Parsing with HTML::PullParser ...\n"; my @text; my $p = HTML::PullParser->new( doc => \$content, text => '@{dtext}', ) || die $!; while (my $token = $p->get_token) { push @text, $token if $token =~ /$Pat/; } print Dumper(@text), "\n"; } sub use_tokeparser { print "Parsing with HTML::TokeParser ...\n"; my @text; my $p = HTML::TokeParser->new(\$content) or die "$!"; while(my $tok = $p->get_token) { if ($tok->[0] eq 'T' && $tok->[1] =~ /$Pat/) { push @text, $tok->[1]; } } print Dumper(@text), "\n"; }

Fri Nov 14 12:14:26 2003 Guest - Correspondence added

From:

Arun Persad

Here is a sample run: Parsing with HTML::Parser ... $VAR1 = 'Electronic resources in the British Library St Pancras Reading Rooms'; $VAR2 = 'lectronic resources in the British Library'; $VAR3 = ' Contains details of over 325,000 published conference proceedings held by the British Library Document Supply Centre. '; $VAR4 = ' Contains records of over 475,000 serials collected by the British Library Document Supply Centre since 1960. The records consist of holdings of the following libraries: the British Library, the Science Museum Library (SML), Cambridge University Library (CUL). Where available, holding information is given. '; $VAR5 = 'British Humanities Index (BHInet)'; $VAR6 = 'British Library Catalogue'; $VAR7 = ' British Library Catalogue to 1995 on CD-ROM. '; $VAR8 = 'British Library Map Catalogue'; $VAR9 = ' The map catalogue brings together records for most of the British Library\'s materials relating to Maps, including atlases, globes and printed books (many of which do not have \'Maps\' shelfmarks and many Maps held by the Department of Manuscripts). '; $VAR10 = 'British National Bibliography'; $VAR11 = ' British Newspaper Index '; $VAR12 = ' British Nursing Index'; $VAR13 = ' British Pharmacopoeia 2001'; $VAR14 = '© The British Library'; Parsing with HTML::PullParser ... $VAR1 = 'Electronic resources in the British Library St Pancras Reading Rooms'; $VAR2 = 'lectronic resources in the British Library'; $VAR3 = ' Contains details of over 325,000 published conference proceedings held by the British Library Document Supply Centre. '; $VAR4 = ' British Library Document Supply Centre since 1960. The records consist of holdings of the following libraries: the British Library, the Science Museum Library (SML), Cambridge University Library (CUL). Where available, holding information is given. '; $VAR5 = 'British Humanities Index (BHInet)'; $VAR6 = 'British Library Catalogue'; $VAR7 = ' British Library Catalogue to 1995 on CD-ROM. '; $VAR8 = 'British Library Map Catalogue'; $VAR9 = ' The map catalogue brings together records for most of the British Library\'s materials relating to Maps, including atlases, globes and printed books (many of which do not have \'Maps\' shelfmarks'; $VAR10 = 'British National Bibliography'; $VAR11 = ' British'; $VAR12 = ' Newspaper Index '; $VAR13 = ' British Nursing Index'; $VAR14 = ' British Pharmacopoeia 2001'; $VAR15 = '© The British Library'; Parsing with HTML::TokeParser ... $VAR1 = 'Electronic resources in the British Library St Pancras Reading Rooms'; $VAR2 = 'lectronic resources in the British Library'; $VAR3 = ' Contains details of over 325,000 published conference proceedings held by the British Library Document Supply Centre. '; $VAR4 = ' British Library Document Supply Centre since 1960. The records consist of holdings of the following libraries: the British Library, the Science Museum Library (SML), Cambridge University Library (CUL). Where available, holding information is given. '; $VAR5 = 'British Humanities Index (BHInet)'; $VAR6 = 'British Library Catalogue'; $VAR7 = ' British Library Catalogue to 1995 on CD-ROM. '; $VAR8 = 'British Library Map Catalogue'; $VAR9 = ' The map catalogue brings together records for most of the British Library\'s materials relating to Maps, including atlases, globes and printed books (many of which do not have \'Maps\' shelfmarks'; $VAR10 = 'British National Bibliography'; $VAR11 = ' British'; $VAR12 = ' Newspaper Index '; $VAR13 = ' British Nursing Index'; $VAR14 = ' British Pharmacopoeia 2001'; $VAR15 = '© The British Library';

Fri Nov 14 13:49:44 2003 gisle [...] ActiveState.com - Correspondence added

To:	bug-HTML-Parser [...] rt.cpan.org
CC:	"AdminCc of cpan Ticket #4399": ;
Subject:	Re: [cpan #4399] Bug in HTML::PullParser / HTML::TokeParser
From:	Gisle Aas <gisle [...] ActiveState.com>
Date:	14 Nov 2003 10:21:12 -0800
RT-Send-Cc:

"Guest via RT" <bug-HTML-Parser@rt.cpan.org> writes: Show quoted text

> Hello, I am Arun Persad, using Activestate Perl 5.8 on Win NT.

Hi, I'm Gisle Aas, working for ActiveState, but using Redhat Linux. Show quoted text

> I think I've found a bug in HTML::PullParser / HTML::TokeParser - > text tokens are not tokenized consistently. Some text tokens (which > should count as a single token) are being split.

That's just how things work. Text can be split up arbitrary by the parser. The only guarantee is that a word (sequence of non-whitespace) will not be split up. Either you cope or you turn on the option $parser->unbroken_text(1) to ask the parser to splice text segments together for you. Regards, Gisle

Thu Apr 01 07:12:18 2004 GAAS [...] cpan.org - Status changed from 'new' to 'resolved'