Bug #25477 for HTML-Scrubber: self closing tags

Thu Mar 15 22:30:41 2007 nab83 [...] yahoo.com - Ticket created

Subject:	self closing tags
Date:	Thu, 15 Mar 2007 17:43:36 -0700 (PDT)
To:	bug-HTML-Scrubber [...] rt.cpan.org
From:	nabeel mohammed <nab83 [...] yahoo.com>

Hi, I am trying to use HTML::Scrubber to clean some script tags and get the rest of the html. Here is an html fragment I am using: <script src="www.google.com/script.js" /> <b> this is a line of bold </b> <script type="text/javascript"> alert("hello") </script> <h> this is a line of bold </h> And here is the perl code I am running: my $scrubber = new HTML::Scrubber; $scrubber->default(1); my $scrubbed = $scrubber->scrub( $text ); print "$scrubbed"; All I see printed is <h> this is a line of bold </h> Now I might be missing something really obvious, but I can't for figure it out. Thanks Nabeel Show quoted text

____________________________________________________________________________________ Looking for earth-friendly autos? Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. http://autos.yahoo.com/green_center/

Sun Jun 22 08:03:11 2008 trendele [...] imtek.de - Correspondence added

From:

trendele [...] imtek.de

This is because HTML::Parser ignores self-closing tags by default, and HTML::Scrubber does not set empty_element_tags(). I suggest adding this to HTML::Scrubber. In the meantime, you can set it manually as a workaround: my $scrubber = HTML::Scrubber->new; $scrubber->{_p}->empty_element_tags(1); Now your example should work again. On Thu Mar 15 22:30:41 2007, nab83@yahoo.com wrote: Show quoted text

> Hi, > I am trying to use HTML::Scrubber to clean some script tags and get > the > rest of the html. Here is an html fragment I am using: > > <script src="www.google.com/script.js" /> > > > <b> this is a line of bold </b> > > <script type="text/javascript"> > alert("hello") > </script> > > <h> this is a line of bold </h> > > > And here is the perl code I am running: > > my $scrubber = new HTML::Scrubber; > $scrubber->default(1); > my $scrubbed = $scrubber->scrub( $text ); > > print "$scrubbed"; > > All I see printed is > > > <h> this is a line of bold </h> > > Now I might be missing something really obvious, but I can't for > figure it > out. > Thanks > Nabeel > > > > > > >

Show quoted text

____________________________________________________________________________________

> Looking for earth-friendly autos? > Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. > http://autos.yahoo.com/green_center/

Sun Jun 22 08:03:14 2008 The RT System itself - Status changed from 'new' to 'open'

Wed Apr 22 17:43:34 2009 MARKSTOS [...] cpan.org - Correspondence added

On Sun Jun 22 08:03:11 2008, trendele@imtek.de wrote: Show quoted text

> This is because HTML::Parser ignores self-closing tags by default, and > HTML::Scrubber does not set empty_element_tags(). > I suggest adding this to HTML::Scrubber. In the meantime, you can set > it > manually as a workaround: > > my $scrubber = HTML::Scrubber->new; > $scrubber->{_p}->empty_element_tags(1);

This proposed patch would cause another test to fail in t/07_booleans. In particular, after parsing, this: <br /> would become: <br></br> That result is with 3.56. Maybe newer HTML::Parsers are smarter. Mark

Wed Apr 22 18:04:02 2009 MARKSTOS [...] cpan.org - Correspondence added

On Wed Apr 22 17:43:34 2009, MARKSTOS wrote: Show quoted text

> On Sun Jun 22 08:03:11 2008, trendele@imtek.de wrote:

> > This is because HTML::Parser ignores self-closing tags by default, and > > HTML::Scrubber does not set empty_element_tags(). > > I suggest adding this to HTML::Scrubber. In the meantime, you can set > > it > > manually as a workaround: > > > > my $scrubber = HTML::Scrubber->new; > > $scrubber->{_p}->empty_element_tags(1);

> > This proposed patch would cause another test to fail in t/07_booleans. > In particular, after parsing, this: > > <br /> > would become: > <br></br>

On further review, I think this is acceptable behavior. When viewed under an XHTML transitional or 'strict' doctype, this renders as a single line break: <br></br> In quirks mode, it would count as two line breaks. I think then this behavior is "good enough" and the resolution can be update the tests to reflect this behavior.

Tue May 12 11:19:32 2020 DAKKAR [...] cpan.org - Correspondence added

I think this bug can be closed. HTML::Scrubber 0.19 HTML::Parser 3.72 with the same program and input as the original reporter, I get this output: ------ <b> this is a line of bold </b> <h> this is a line of bold </h> ------ which I think is the expected result. Also, <br/> gets printed as <br />, which also looks correct.