Bug #21008 for HTML-Strip: Intermittent overkill after <>

Tue Aug 15 21:14:05 2006 bellaire [...] ad.ufl.edu - Ticket created

Subject:	Intermittent overkill after <>
Date:	Tue, 15 Aug 2006 21:13:12 -0400
To:	<bug-HTML-Strip [...] rt.cpan.org>
From:	"Bellaire,Adam P" <bellaire [...] ad.ufl.edu>

Hi there, I've been using HTML::Strip for some time now, and it's great. However, I've recently found a problem that seems to be caused by the presence of a single <> in the text to be stripped, and what's more, it only happens intermittently. I'm using HTML::Strip to remove tags from a series of chunks of text, all of the form: From: <> a. Title: some text b. Etc.. These are emails that are submitted through a simple HTML form. There are no real HTML tags in them, just the single lonely <> sequence. When I view the stripped version of the text, some of the emails are intact (sans <>), and others have only the From: line, and nothing else that was to follow the <>. What's more, from run to run different emails will be affected by this apparent bug. That is, when I view a set of 10 emails, on one run six of them will be truncated, and the others will be fine. On another run, only two will be truncated, and the others fine, and there seems to be no correlation between the content of the text and when this bug will appear. I've worked around the problem by stripping the character sequence <> using a perl regex and then handing the result to HTML::Strip, and this solves the problem completely. But I'd much rather not have to use the regex, and I thought I should report this bug in case anyone else might be affected. Thanks again for this terrific module!

Wed Feb 13 13:05:24 2008 a.r.ferreira [...] gmail.com - Correspondence added

Subject:	[rt.cpan.org #21008] Intermittent overkill after <>
Date:	Wed, 13 Feb 2008 16:04:39 -0200
To:	bug-html-strip <bug-HTML-Strip [...] rt.cpan.org>, "Alex Bowley" <kilinrax [...] cpan.org>
From:	"Adriano Ferreira" <a.r.ferreira [...] gmail.com>

I cannot reproduce this issue. Maybe it has to do with using the HTML stripper in multiple e-mails while not resetting it at the end of each run. But even so, it was needed that some tags (and possibly embedded quotes) existed, so that the stripper state was causing these problems. The attached test passes successfully while trying to reveal the pointed problem. Please give a return. If this problem still affects you, provide a minimum sample so that the issue can be reproduced and the cause of the ill-behavior tracked down. As said in the documentation, you probably want to use the module like that: use HTML::Strip; my $hs = HTML::Strip->new; for my $msg (@messages) { my $clean_msg = $hs->parse($msg); $hs->eof; # reset after each message } ########### # the mentioned test (which is also attached) # http://rt.cpan.org/Public/Bug/Display.html?id=21008 use Test::More no_plan => 1; BEGIN { use_ok 'HTML::Strip' } # stripping comments { my $hs = HTML::Strip->new(); is( $hs->parse( "a<>b" ), "a b", 'edge case with <> ok' ); $hs->eof; is( $hs->parse( "a<>b c<>d" ), "a b c d", 'edge case with <>s ok' ); $hs->eof; is( $hs->parse( "From: <>\n\na. Title: some text\n\nb. etc\n" ), "From: \n\na. Title: some text\n\nb. etc\n", 'test case' ); is( $hs->parse( "From: <>\n\na. Title: some text\n\nb. etc\n" ), "From: \n\na. Title: some text\n\nb. etc\n", 'test case' ); $hs->eof; is( $hs->parse( q{this is an "example" with 'quoted' parts that should not be stripped} ), q{this is an "example" with 'quoted' parts that should not be stripped} ); }

Message body is not shown because sender requested not to inline it.

Wed Feb 13 13:05:26 2008 The RT System itself - Status changed from 'new' to 'open'

Wed Sep 24 08:27:31 2014 KILINRAX [...] cpan.org - Correspondence added

Test case added in 1.07, which passes: https://metacpan.org/release/KILINRAX/HTML-Strip-1.07

Wed Sep 24 08:27:33 2014 KILINRAX [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Apr 27 09:13:37 2016 KILINRAX [...] cpan.org - Correspondence added

On Wed Sep 24 08:27:31 2014, KILINRAX wrote: Show quoted text

> Test case added in 1.07, which passes: > https://metacpan.org/release/KILINRAX/HTML-Strip-1.07

Wed Apr 27 09:13:38 2016 KILINRAX [...] cpan.org - Fixed in 1.07 added