Bug #32355 for HTML-Strip: Problem in stripping html comments which includes a ' or "

Wed Jan 16 10:05:10 2008 pha [...] intratools.de - Ticket created

Subject:

Problem in stripping html comments which includes a ' or "

There is a problem while stripping html code like this: <html>  Hello World! </html> if i remove the ' and " it works The problem only appears if the ' is inside  also <html>  Hello World! </html> have the same problem. the "parse" method returns an empty string. But should return "Hello World!"

Wed Feb 13 12:36:55 2008 a.r.ferreira [...] gmail.com - Correspondence added

Subject:	[rt.cpan.org #32355] [PATCH] Problem in stripping html comments which includes a ' or "
Date:	Wed, 13 Feb 2008 15:35:41 -0200
To:	bug-html-strip <bug-HTML-Strip [...] rt.cpan.org>, "Alex Bowley" <kilinrax [...] cpan.org>
From:	"Adriano Ferreira" <a.r.ferreira [...] gmail.com>

The problem was because the quote parsing was still active while parsing comments, but it should not.    should all be stripped without paying no attention to the inner content of the comment. The change at strip_html.c was minimal and the patch also adds to the documentation that desirable rule. A test checking the stripping of a few declarations and comments is added. Note that it uses Test::More as suggested in the RT ticket #33059. Best, Adriano Ferreira diff -ruN HTML-Strip-1.06/MANIFEST HTML-Strip/MANIFEST --- HTML-Strip-1.06/MANIFEST 2006-02-10 09:14:25.000000000 -0200 +++ HTML-Strip/MANIFEST 2008-02-13 15:15:35.000000000 -0200 @@ -8,3 +8,4 @@ strip_html.c typemap test.pl +t/comment.t diff -ruN HTML-Strip-1.06/strip_html.c HTML-Strip/strip_html.c --- HTML-Strip-1.06/strip_html.c 2006-02-10 09:14:25.000000000 -0200 +++ HTML-Strip/strip_html.c 2008-02-13 14:58:33.000000000 -0200 @@ -61,8 +61,9 @@ } } else { /* not in a quote */ - /* check for quote characters */ - if( *p_raw == '\'' || *p_raw == '\"' ) { + /* check for quote characters, but not in a comment */ + if( !stripper->f_in_comment && + ( *p_raw == '\'' || *p_raw == '\"' ) ) { stripper->f_in_quote = 1; stripper->quote = *p_raw; /* reset lastchar_* flags in case we have something perverse like '-"' or '/"' */ diff -ruN HTML-Strip-1.06/Strip.pm HTML-Strip/Strip.pm --- HTML-Strip-1.06/Strip.pm 2006-02-10 09:18:32.000000000 -0200 +++ HTML-Strip/Strip.pm 2008-02-13 15:24:23.000000000 -0200 @@ -136,7 +136,9 @@ declaration or a comment. Within such tags, C<E<gt>> characters do not end the tag if they appear within pairs of double dashes (e.g. C<E<lt>!-- E<lt>a href="old.htm"E<gt>old pageE<lt>/aE<gt> --E<gt>> would be -stripped completely). +stripped completely). Inside a comment, no parsing for quotes +is done as well. (That means C<E<lt>!-- comment with ' quote " --E<gt>> +are entirely stripped.) =back diff -ruN HTML-Strip-1.06/t/comment.t HTML-Strip/t/comment.t --- HTML-Strip-1.06/t/comment.t 1969-12-31 21:00:00.000000000 -0300 +++ HTML-Strip/t/comment.t 2008-02-13 15:03:10.000000000 -0200 @@ -0,0 +1,38 @@ + +# http://rt.cpan.org/Public/Bug/Display.html?id=32355 + +use Test::More tests => 7; + +BEGIN { use_ok 'HTML::Strip' } + +# stripping declarations +{ + my $hs = HTML::Strip->new(); + is( $hs->parse( q{<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>Text</html>} ), + "Text", 'decls are stripped' ); + $hs->eof; +} + +# stripping comments +{ + my $hs = HTML::Strip->new(); + is( $hs->parse( q{<html>Hello World!</html>} ), + "Hello World!", "comments are stripped" ); + $hs->eof; + + is( $hs->parse( q{<html>Hello World!</html>} ), + "Hello World!", q{comments may contain '} ); + $hs->eof; + + is( $hs->parse( q{<html>Hello World!</html>} ), + "Hello World!", q{comments may contain "} ); + $hs->eof; + + is( $hs->parse( q{<html><!-- comment -- "quote" >Hello World!</html>} ), + "Hello World!", "weird decls are stripped" ); + $hs->eof; + + is( $hs->parse( "a<>b" ), + "a b", 'edge case with <> ok' ); + +}

Message body is not shown because sender requested not to inline it.

Wed Feb 13 12:36:58 2008 The RT System itself - Status changed from 'new' to 'open'

Wed Jul 23 21:51:06 2008 http://cra.id.fedoraproject.org/ - Correspondence added

On Wed Feb 13 12:36:55 2008, a.r.ferreira@gmail.com wrote: Show quoted text

> The problem was because the quote parsing was still active while > parsing comments, but it should not. > >  >  >  > > should all be stripped without paying no attention to the inner > content of the comment. > > The change at strip_html.c was minimal and the patch also adds to the > documentation that desirable rule.

With this patch, "make test" fails for me on Fedora 9: + make test PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/comment.... # Failed test 'edge case with <> ok' # at t/comment.t line 35. # got: 'a' # expected: 'a b' # Looks like you failed 1 test of 7. dubious Test returned status 1 (wstat 256, 0x100) DIED. FAILED test 7 Failed 1/7 tests, 85.71% okay Failed Test Stat Wstat Total Fail List of Failed ------------------------------------------------------------------------------- t/comment.t 1 256 7 1 7 Failed 1/1 test scripts. 1/7 subtests failed. Files=1, Tests=7, 1 wallclock secs ( 0.06 cusr + 0.01 csys = 0.07 CPU) Failed 1/1 test programs. 1/7 subtests failed. make: *** [test_dynamic] Error 1

Wed Sep 24 08:30:00 2014 KILINRAX [...] cpan.org - Correspondence added

Test case added in 1.07, which passes: https://metacpan.org/release/KILINRAX/HTML-Strip-1.07

Wed Sep 24 08:30:02 2014 KILINRAX [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Apr 27 09:14:08 2016 KILINRAX [...] cpan.org - Correspondence added

On Wed Sep 24 08:30:00 2014, KILINRAX wrote: Show quoted text

> Test case added in 1.07, which passes: > https://metacpan.org/release/KILINRAX/HTML-Strip-1.07

Wed Apr 27 09:14:09 2016 KILINRAX [...] cpan.org - Fixed in 1.07 added