Skip Menu |

This queue is for tickets about the HTML-Strip CPAN distribution.

Report information
The Basics
Id: 32355
Status: resolved
Priority: 0/
Queue: HTML-Strip

People
Owner: Nobody in particular
Requestors: pha [...] intratools.de
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 1.06
Fixed in: 1.07



Subject: Problem in stripping html comments which includes a ' or "
There is a problem while stripping html code like this: <html> <!-- I'm Buggy because of the '. with a " is the same problem --> Hello World! </html> if i remove the ' and " it works The problem only appears if the ' is inside <!-- --> also <html> <!-- //I'm Buggy because of the '. with a " is the same problem --> Hello World! </html> have the same problem. the "parse" method returns an empty string. But should return "Hello World!"
Subject: [rt.cpan.org #32355] [PATCH] Problem in stripping html comments which includes a ' or "
Date: Wed, 13 Feb 2008 15:35:41 -0200
To: bug-html-strip <bug-HTML-Strip [...] rt.cpan.org>, "Alex Bowley" <kilinrax [...] cpan.org>
From: "Adriano Ferreira" <a.r.ferreira [...] gmail.com>
The problem was because the quote parsing was still active while parsing comments, but it should not. <!-- comment with ' apos --> <!-- comment with " quote --> <!-- comment with ' both " --> should all be stripped without paying no attention to the inner content of the comment. The change at strip_html.c was minimal and the patch also adds to the documentation that desirable rule. A test checking the stripping of a few declarations and comments is added. Note that it uses Test::More as suggested in the RT ticket #33059. Best, Adriano Ferreira diff -ruN HTML-Strip-1.06/MANIFEST HTML-Strip/MANIFEST --- HTML-Strip-1.06/MANIFEST 2006-02-10 09:14:25.000000000 -0200 +++ HTML-Strip/MANIFEST 2008-02-13 15:15:35.000000000 -0200 @@ -8,3 +8,4 @@ strip_html.c typemap test.pl +t/comment.t diff -ruN HTML-Strip-1.06/strip_html.c HTML-Strip/strip_html.c --- HTML-Strip-1.06/strip_html.c 2006-02-10 09:14:25.000000000 -0200 +++ HTML-Strip/strip_html.c 2008-02-13 14:58:33.000000000 -0200 @@ -61,8 +61,9 @@ } } else { /* not in a quote */ - /* check for quote characters */ - if( *p_raw == '\'' || *p_raw == '\"' ) { + /* check for quote characters, but not in a comment */ + if( !stripper->f_in_comment && + ( *p_raw == '\'' || *p_raw == '\"' ) ) { stripper->f_in_quote = 1; stripper->quote = *p_raw; /* reset lastchar_* flags in case we have something perverse like '-"' or '/"' */ diff -ruN HTML-Strip-1.06/Strip.pm HTML-Strip/Strip.pm --- HTML-Strip-1.06/Strip.pm 2006-02-10 09:18:32.000000000 -0200 +++ HTML-Strip/Strip.pm 2008-02-13 15:24:23.000000000 -0200 @@ -136,7 +136,9 @@ declaration or a comment. Within such tags, C<E<gt>> characters do not end the tag if they appear within pairs of double dashes (e.g. C<E<lt>!-- E<lt>a href="old.htm"E<gt>old pageE<lt>/aE<gt> --E<gt>> would be -stripped completely). +stripped completely). Inside a comment, no parsing for quotes +is done as well. (That means C<E<lt>!-- comment with ' quote " --E<gt>> +are entirely stripped.) =back diff -ruN HTML-Strip-1.06/t/comment.t HTML-Strip/t/comment.t --- HTML-Strip-1.06/t/comment.t 1969-12-31 21:00:00.000000000 -0300 +++ HTML-Strip/t/comment.t 2008-02-13 15:03:10.000000000 -0200 @@ -0,0 +1,38 @@ + +# http://rt.cpan.org/Public/Bug/Display.html?id=32355 + +use Test::More tests => 7; + +BEGIN { use_ok 'HTML::Strip' } + +# stripping declarations +{ + my $hs = HTML::Strip->new(); + is( $hs->parse( q{<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html>Text</html>} ), + "Text", 'decls are stripped' ); + $hs->eof; +} + +# stripping comments +{ + my $hs = HTML::Strip->new(); + is( $hs->parse( q{<html><!-- a comment to be stripped -->Hello World!</html>} ), + "Hello World!", "comments are stripped" ); + $hs->eof; + + is( $hs->parse( q{<html><!-- comment with a ' apos -->Hello World!</html>} ), + "Hello World!", q{comments may contain '} ); + $hs->eof; + + is( $hs->parse( q{<html><!-- comment with a " quote -->Hello World!</html>} ), + "Hello World!", q{comments may contain "} ); + $hs->eof; + + is( $hs->parse( q{<html><!-- comment -- "quote" >Hello World!</html>} ), + "Hello World!", "weird decls are stripped" ); + $hs->eof; + + is( $hs->parse( "a<>b" ), + "a b", 'edge case with <> ok' ); + +}

Message body is not shown because sender requested not to inline it.

On Wed Feb 13 12:36:55 2008, a.r.ferreira@gmail.com wrote: Show quoted text
> The problem was because the quote parsing was still active while > parsing comments, but it should not. > > <!-- comment with ' apos --> > <!-- comment with " quote --> > <!-- comment with ' both " --> > > should all be stripped without paying no attention to the inner > content of the comment. > > The change at strip_html.c was minimal and the patch also adds to the > documentation that desirable rule.
With this patch, "make test" fails for me on Fedora 9: + make test PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/comment.... # Failed test 'edge case with <> ok' # at t/comment.t line 35. # got: 'a' # expected: 'a b' # Looks like you failed 1 test of 7. dubious Test returned status 1 (wstat 256, 0x100) DIED. FAILED test 7 Failed 1/7 tests, 85.71% okay Failed Test Stat Wstat Total Fail List of Failed ------------------------------------------------------------------------------- t/comment.t 1 256 7 1 7 Failed 1/1 test scripts. 1/7 subtests failed. Files=1, Tests=7, 1 wallclock secs ( 0.06 cusr + 0.01 csys = 0.07 CPU) Failed 1/1 test programs. 1/7 subtests failed. make: *** [test_dynamic] Error 1
Test case added in 1.07, which passes: https://metacpan.org/release/KILINRAX/HTML-Strip-1.07
On Wed Sep 24 08:30:00 2014, KILINRAX wrote: Show quoted text
> Test case added in 1.07, which passes: > https://metacpan.org/release/KILINRAX/HTML-Strip-1.07