Skip Menu |

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 3166
Status: resolved
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: siegmann [...] tinbergen.nl
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: (no value)



Subject: Make get_text accept multiple tokens to read up to
As is it now, get_text only accepts one endtag, i.e., $p->get_text( [$endtag] ) But what if I want to get the text up to either an <a>, <img> or <frame> token, for example? This is extremely useful in the context of retrieving text from a page with its surrounding links or images, see WWW::Mechanize. Please consider get_text to have the same arguments as get_tag, namely ([$tag, ...]). I (still) have a three-line patch lying around if you are interested, please consider it!
My mailbox is a mess. Can you post the patch you suggest here?
Subject: patch
From: siegmann [...] tinbergen.nl
[GAAS - Fri Oct 3 08:50:30 2003]: Show quoted text
> My mailbox is a mess. Can you post the patch you suggest here?
Here is the email again, patch attached (arjen.diff) cheers, arjen I have thought about the two ways of extending the get_text sub in HTML:TokeParser. (1. let it have an array argument, 2. reference to array) I think that the array reference(2) could be useful for future use, but it breaks backward compatibility, doesn't it? Option 1. is a one-line patch (excluding documentation changes, which I've also done), and existing calls to get_text remain valid. It would be great if you could consider the attached patch that accomplishes it. Please let me know what you think..
--- TokeParser.pm Tue Apr 10 19:44:04 2001 +++ TokeParser_new.pm Sat Mar 15 19:07:38 2003 @@ -88,7 +88,7 @@ } else { $tag = "/$tag"; } - if (!defined($endat) || $endat eq $tag) { + if (!defined($endat) || grep { $_ eq $tag } ($endat,@_) ) { $self->unget_token($token); last; } @@ -200,13 +200,15 @@ ["/$tag", $text] -=item $p->get_text( [$endtag] ) +=item $p->get_text( [$endtag, ...] ) This method returns all text found at the current position. It will -return a zero length string if the next token is not text. The -optional $endtag argument specifies that any text occurring before the -given tag is to be returned. Any entities will be converted to their -corresponding character. +return a zero length string if the next token is not text. If +one or more arguments are given, then we return any text occurring before the first of the specified tags found. For example: + + $p->get_text("p", "br"); + +will return the text up to either a paragraph of linebreak element. Any entities will be converted to their corresponding character. The $p->{textify} attribute is a hash that defines how certain tags can be treated as text. If the name of a start tag matches a key in this @@ -225,7 +227,7 @@ This means that <IMG> and <APPLET> tags are treated as text, and that the text to substitute can be found in the ALT attribute. -=item $p->get_trimmed_text( [$endtag] ) +=item $p->get_trimmed_text( [$endtag, ...] ) Same as $p->get_text above, but will collapse any sequences of white space to a single space character. Leading and trailing white space is