Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the HTML-Tidy CPAN distribution.

Report information
The Basics
Id: 5548
Status: resolved
Priority: 0/
Queue: HTML-Tidy

People
Owner: Nobody in particular
Requestors: ben [...] sixapart.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



To: Andy Lester <andy [...] petdance.com>
From: Benjamin Trott <ben [...] sixapart.com>
Subject: HTML::Tidy -- Patch for a "clean" method
Date: Tue, 2 Mar 2004 17:51:52 -0800
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Andy, Attached is a patch against 1.01_01 to provide a "clean" method to HTML::Tidy. If you're interested in providing HTML cleaning mechanisms through HTML::Tidy, this might be a useful start. I think it could take some thinking into how it should actually work--currently, for example, this: <a href="http://www.example.com/"><em>This is a test.</a> turns into this: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <meta name="generator" content= "HTML Tidy for Linux/x86 (vers 1st March 2004), see www.w3.org"> <title></title> </head> <body> <a href="http://www.example.com/"><em>This is a test.</em></a> </body> </html> In other words, tidy encapsulates it in a full HTML document format. I don't know if this is what the caller would expect. It's possible that "clean" should get the string back into the minimal form that it was originally given, but I fear that may be a rather difficult task, and that callers should just be instructed to provide an entire HTML document, rather than a fragment. Anyway, just some thoughts. :) Ben - --- HTML-Tidy-1.01_01/lib/HTML/Tidy.pm 2004-02-29 09:41:29.000000000 - -0800 +++ HTML-Tidy-1.01_01-new/lib/HTML/Tidy.pm 2004-03-02 17:44:38.000000000 -0800 @@ -198,6 +198,20 @@ return !$parse_errors; } +=head2 clean( $str [, $str...] ) + +Cleans a string, or list of strings, that make up a single HTML file. + +Returns true if all went OK, or false if there was some problem calling +tidy, or parsing tidy's output. + +=cut + +sub clean { + my $self = shift; + _tidy_clean(join( "", @_ )); +} + # Tells whether a given message object is one that we should keep. sub _is_keeper { - --- HTML-Tidy-1.01_01/Tidy.xs 2004-02-29 09:37:40.000000000 -0800 +++ HTML-Tidy-1.01_01-new/Tidy.xs 2004-03-02 17:44:27.000000000 -0800 @@ -37,3 +37,32 @@ OUTPUT: RETVAL +SV * +_tidy_clean(input) + INPUT: + char *input + CODE: + TidyBuffer errbuf = {0}; + TidyDoc tdoc = tidyCreate(); // Initialize "document" + TidyBuffer output = {0}; + + int rc; + + rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics + if ( rc >= 0 ) + rc = tidyParseString( tdoc, input ); // Parse the input + if (tidyCleanAndRepair(tdoc) >= 0) { + tidySaveBuffer(tdoc, &output); + char *str = (char *)output.bp; + RETVAL = newSVpvn( str, strlen(str) ); + tidyBufFree( &output ); + } else { + XSRETURN_UNDEF; + } + + tidyBufFree( &errbuf ); + tidyRelease( tdoc ); + + OUTPUT: + RETVAL + -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (Darwin) iD8DBQFARTo4zGeEk2uv818RAhR3AJ9Ui4DQQ0stQBJOg23fLrITNQr+bwCfUG9O FiYz/5qK0k8MjkvIh1PCY1w= =pIm5 -----END PGP SIGNATURE-----
Done. Will be in 1.02. Thanks, Ben.