To: | Andy Lester <andy [...] petdance.com> |
From: | Benjamin Trott <ben [...] sixapart.com> |
Subject: | HTML::Tidy -- Patch for a "clean" method |
Date: | Tue, 2 Mar 2004 17:51:52 -0800 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Andy,
Attached is a patch against 1.01_01 to provide a "clean" method to
HTML::Tidy. If you're interested in providing HTML cleaning mechanisms
through HTML::Tidy, this might be a useful start.
I think it could take some thinking into how it should actually
work--currently, for example, this:
<a href="http://www.example.com/"><em>This is a test.</a>
turns into this:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 1st March 2004), see www.w3.org">
<title></title>
</head>
<body>
<a href="http://www.example.com/"><em>This is a test.</em></a>
</body>
</html>
In other words, tidy encapsulates it in a full HTML document format.
I don't know if this is what the caller would expect. It's possible
that "clean" should get the string back into the minimal form that it
was originally given, but I fear that may be a rather difficult task,
and that callers should just be instructed to provide an entire HTML
document, rather than a fragment.
Anyway, just some thoughts. :)
Ben
- --- HTML-Tidy-1.01_01/lib/HTML/Tidy.pm 2004-02-29 09:41:29.000000000
- -0800
+++ HTML-Tidy-1.01_01-new/lib/HTML/Tidy.pm 2004-03-02
17:44:38.000000000 -0800
@@ -198,6 +198,20 @@
return !$parse_errors;
}
+=head2 clean( $str [, $str...] )
+
+Cleans a string, or list of strings, that make up a single HTML file.
+
+Returns true if all went OK, or false if there was some problem calling
+tidy, or parsing tidy's output.
+
+=cut
+
+sub clean {
+ my $self = shift;
+ _tidy_clean(join( "", @_ ));
+}
+
# Tells whether a given message object is one that we should keep.
sub _is_keeper {
- --- HTML-Tidy-1.01_01/Tidy.xs 2004-02-29 09:37:40.000000000 -0800
+++ HTML-Tidy-1.01_01-new/Tidy.xs 2004-03-02 17:44:27.000000000 -0800
@@ -37,3 +37,32 @@
OUTPUT:
RETVAL
+SV *
+_tidy_clean(input)
+ INPUT:
+ char *input
+ CODE:
+ TidyBuffer errbuf = {0};
+ TidyDoc tdoc = tidyCreate(); // Initialize
"document"
+ TidyBuffer output = {0};
+
+ int rc;
+
+ rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture
diagnostics
+ if ( rc >= 0 )
+ rc = tidyParseString( tdoc, input ); // Parse
the input
+ if (tidyCleanAndRepair(tdoc) >= 0) {
+ tidySaveBuffer(tdoc, &output);
+ char *str = (char *)output.bp;
+ RETVAL = newSVpvn( str, strlen(str) );
+ tidyBufFree( &output );
+ } else {
+ XSRETURN_UNDEF;
+ }
+
+ tidyBufFree( &errbuf );
+ tidyRelease( tdoc );
+
+ OUTPUT:
+ RETVAL
+
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (Darwin)
iD8DBQFARTo4zGeEk2uv818RAhR3AJ9Ui4DQQ0stQBJOg23fLrITNQr+bwCfUG9O
FiYz/5qK0k8MjkvIh1PCY1w=
=pIm5
-----END PGP SIGNATURE-----