Skip Menu |

This queue is for tickets about the HTML-Scrubber CPAN distribution.

Report information
The Basics
Id: 25477
Status: open
Priority: 0/
Queue: HTML-Scrubber

People
Owner: Nobody in particular
Requestors: nab83 [...] yahoo.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: self closing tags
Date: Thu, 15 Mar 2007 17:43:36 -0700 (PDT)
To: bug-HTML-Scrubber [...] rt.cpan.org
From: nabeel mohammed <nab83 [...] yahoo.com>
Hi, I am trying to use HTML::Scrubber to clean some script tags and get the rest of the html. Here is an html fragment I am using: <script src="www.google.com/script.js" /> <b> this is a line of bold </b> <script type="text/javascript"> alert("hello") </script> <h> this is a line of bold </h> And here is the perl code I am running: my $scrubber = new HTML::Scrubber; $scrubber->default(1); my $scrubbed = $scrubber->scrub( $text ); print "$scrubbed"; All I see printed is <h> this is a line of bold </h> Now I might be missing something really obvious, but I can't for figure it out. Thanks Nabeel Show quoted text
____________________________________________________________________________________ Looking for earth-friendly autos? Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. http://autos.yahoo.com/green_center/
From: trendele [...] imtek.de
This is because HTML::Parser ignores self-closing tags by default, and HTML::Scrubber does not set empty_element_tags(). I suggest adding this to HTML::Scrubber. In the meantime, you can set it manually as a workaround: my $scrubber = HTML::Scrubber->new; $scrubber->{_p}->empty_element_tags(1); Now your example should work again. On Thu Mar 15 22:30:41 2007, nab83@yahoo.com wrote: Show quoted text
> Hi, > I am trying to use HTML::Scrubber to clean some script tags and get > the > rest of the html. Here is an html fragment I am using: > > <script src="www.google.com/script.js" /> > > > <b> this is a line of bold </b> > > <script type="text/javascript"> > alert("hello") > </script> > > <h> this is a line of bold </h> > > > And here is the perl code I am running: > > my $scrubber = new HTML::Scrubber; > $scrubber->default(1); > my $scrubbed = $scrubber->scrub( $text ); > > print "$scrubbed"; > > All I see printed is > > > <h> this is a line of bold </h> > > Now I might be missing something really obvious, but I can't for > figure it > out. > Thanks > Nabeel > > > > > > >
Show quoted text
____________________________________________________________________________________
> Looking for earth-friendly autos? > Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. > http://autos.yahoo.com/green_center/
On Sun Jun 22 08:03:11 2008, trendele@imtek.de wrote: Show quoted text
> This is because HTML::Parser ignores self-closing tags by default, and > HTML::Scrubber does not set empty_element_tags(). > I suggest adding this to HTML::Scrubber. In the meantime, you can set > it > manually as a workaround: > > my $scrubber = HTML::Scrubber->new; > $scrubber->{_p}->empty_element_tags(1);
This proposed patch would cause another test to fail in t/07_booleans. In particular, after parsing, this: <br /> would become: <br></br> That result is with 3.56. Maybe newer HTML::Parsers are smarter. Mark
On Wed Apr 22 17:43:34 2009, MARKSTOS wrote: Show quoted text
> On Sun Jun 22 08:03:11 2008, trendele@imtek.de wrote:
> > This is because HTML::Parser ignores self-closing tags by default, and > > HTML::Scrubber does not set empty_element_tags(). > > I suggest adding this to HTML::Scrubber. In the meantime, you can set > > it > > manually as a workaround: > > > > my $scrubber = HTML::Scrubber->new; > > $scrubber->{_p}->empty_element_tags(1);
> > This proposed patch would cause another test to fail in t/07_booleans. > In particular, after parsing, this: > > <br /> > would become: > <br></br>
On further review, I think this is acceptable behavior. When viewed under an XHTML transitional or 'strict' doctype, this renders as a single line break: <br></br> In quirks mode, it would count as two line breaks. I think then this behavior is "good enough" and the resolution can be update the tests to reflect this behavior.
I think this bug can be closed. HTML::Scrubber 0.19 HTML::Parser 3.72 with the same program and input as the original reporter, I get this output: ------ <b> this is a line of bold </b> <h> this is a line of bold </h> ------ which I think is the expected result. Also, <br/> gets printed as <br />, which also looks correct.