Bug #3166 for HTML-Parser: Make get_text accept multiple tokens to read up to

Wed Aug 06 05:36:40 2003 Guest - Ticket created

Subject:

Make get_text accept multiple tokens to read up to

As is it now, get_text only accepts one endtag, i.e., $p->get_text( [$endtag] ) But what if I want to get the text up to either an <a>, <img> or <frame> token, for example? This is extremely useful in the context of retrieving text from a page with its surrounding links or images, see WWW::Mechanize. Please consider get_text to have the same arguments as get_tag, namely ([$tag, ...]). I (still) have a three-line patch lying around if you are interested, please consider it!

Fri Oct 03 08:50:30 2003 GAAS [...] cpan.org - Correspondence added

My mailbox is a mess. Can you post the patch you suggest here?

Fri Oct 03 14:32:26 2003 Guest - Correspondence added

Subject:	patch
From:	siegmann [...] tinbergen.nl

[GAAS - Fri Oct 3 08:50:30 2003]: Show quoted text

> My mailbox is a mess. Can you post the patch you suggest here?

Here is the email again, patch attached (arjen.diff) cheers, arjen I have thought about the two ways of extending the get_text sub in HTML:TokeParser. (1. let it have an array argument, 2. reference to array) I think that the array reference(2) could be useful for future use, but it breaks backward compatibility, doesn't it? Option 1. is a one-line patch (excluding documentation changes, which I've also done), and existing calls to get_text remain valid. It would be great if you could consider the attached patch that accomplishes it. Please let me know what you think..

--- TokeParser.pm Tue Apr 10 19:44:04 2001 +++ TokeParser_new.pm Sat Mar 15 19:07:38 2003 @@ -88,7 +88,7 @@ } else { $tag = "/$tag"; } - if (!defined($endat) || $endat eq $tag) { + if (!defined($endat) || grep { $_ eq $tag } ($endat,@_) ) { $self->unget_token($token); last; } @@ -200,13 +200,15 @@ ["/$tag", $text] -=item $p->get_text( [$endtag] ) +=item $p->get_text( [$endtag, ...] ) This method returns all text found at the current position. It will -return a zero length string if the next token is not text. The -optional $endtag argument specifies that any text occurring before the -given tag is to be returned. Any entities will be converted to their -corresponding character. +return a zero length string if the next token is not text. If +one or more arguments are given, then we return any text occurring before the first of the specified tags found. For example: + + $p->get_text("p", "br"); + +will return the text up to either a paragraph of linebreak element. Any entities will be converted to their corresponding character. The $p->{textify} attribute is a hash that defines how certain tags can be treated as text. If the name of a start tag matches a key in this @@ -225,7 +227,7 @@ This means that <IMG> and <APPLET> tags are treated as text, and that the text to substitute can be found in the ALT attribute. -=item $p->get_trimmed_text( [$endtag] ) +=item $p->get_trimmed_text( [$endtag, ...] ) Same as $p->get_text above, but will collapse any sequences of white space to a single space character. Leading and trailing white space is

Fri Oct 10 06:47:03 2003 GAAS [...] cpan.org - Status changed from 'new' to 'resolved'