The problem was because the quote parsing was still active while
parsing comments, but it should not.
<!-- comment with ' apos -->
<!-- comment with " quote -->
<!-- comment with ' both " -->
should all be stripped without paying no attention to the inner
content of the comment.
The change at strip_html.c was minimal and the patch also adds to the
documentation that desirable rule.
A test checking the stripping of a few declarations and comments is
added. Note that it uses Test::More as suggested in the RT ticket
#33059.
Best,
Adriano Ferreira
diff -ruN HTML-Strip-1.06/MANIFEST HTML-Strip/MANIFEST
--- HTML-Strip-1.06/MANIFEST 2006-02-10 09:14:25.000000000 -0200
+++ HTML-Strip/MANIFEST 2008-02-13 15:15:35.000000000 -0200
@@ -8,3 +8,4 @@
strip_html.c
typemap
test.pl
+t/comment.t
diff -ruN HTML-Strip-1.06/strip_html.c HTML-Strip/strip_html.c
--- HTML-Strip-1.06/strip_html.c 2006-02-10 09:14:25.000000000 -0200
+++ HTML-Strip/strip_html.c 2008-02-13 14:58:33.000000000 -0200
@@ -61,8 +61,9 @@
}
} else {
/* not in a quote */
- /* check for quote characters */
- if( *p_raw == '\'' || *p_raw == '\"' ) {
+ /* check for quote characters, but not in a comment */
+ if( !stripper->f_in_comment &&
+ ( *p_raw == '\'' || *p_raw == '\"' ) ) {
stripper->f_in_quote = 1;
stripper->quote = *p_raw;
/* reset lastchar_* flags in case we have something
perverse like '-"' or '/"' */
diff -ruN HTML-Strip-1.06/Strip.pm HTML-Strip/Strip.pm
--- HTML-Strip-1.06/Strip.pm 2006-02-10 09:18:32.000000000 -0200
+++ HTML-Strip/Strip.pm 2008-02-13 15:24:23.000000000 -0200
@@ -136,7 +136,9 @@
declaration or a comment. Within such tags, C<E<gt>> characters do not
end the tag if they appear within pairs of double dashes (e.g. C<E<lt>!--
E<lt>a href="old.htm"E<gt>old pageE<lt>/aE<gt> --E<gt>> would be
-stripped completely).
+stripped completely). Inside a comment, no parsing for quotes
+is done as well. (That means C<E<lt>!-- comment with ' quote " --E<gt>>
+are entirely stripped.)
=back
diff -ruN HTML-Strip-1.06/t/comment.t HTML-Strip/t/comment.t
--- HTML-Strip-1.06/t/comment.t 1969-12-31 21:00:00.000000000 -0300
+++ HTML-Strip/t/comment.t 2008-02-13 15:03:10.000000000 -0200
@@ -0,0 +1,38 @@
+
+#
http://rt.cpan.org/Public/Bug/Display.html?id=32355
+
+use Test::More tests => 7;
+
+BEGIN { use_ok 'HTML::Strip' }
+
+# stripping declarations
+{
+ my $hs = HTML::Strip->new();
+ is( $hs->parse( q{<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01
Transitional//EN"><html>Text</html>} ),
+ "Text", 'decls are stripped' );
+ $hs->eof;
+}
+
+# stripping comments
+{
+ my $hs = HTML::Strip->new();
+ is( $hs->parse( q{<html><!-- a comment to be stripped -->Hello
World!</html>} ),
+ "Hello World!", "comments are stripped" );
+ $hs->eof;
+
+ is( $hs->parse( q{<html><!-- comment with a ' apos -->Hello
World!</html>} ),
+ "Hello World!", q{comments may contain '} );
+ $hs->eof;
+
+ is( $hs->parse( q{<html><!-- comment with a " quote -->Hello
World!</html>} ),
+ "Hello World!", q{comments may contain "} );
+ $hs->eof;
+
+ is( $hs->parse( q{<html><!-- comment -- "quote" >Hello World!</html>} ),
+ "Hello World!", "weird decls are stripped" );
+ $hs->eof;
+
+ is( $hs->parse( "a<>b" ),
+ "a b", 'edge case with <> ok' );
+
+}