Subject: | [Fwd: and \S (\s) regexp in HTML::TreeBuilder] |
Date: | Thu, 04 May 2006 01:40:00 -0800 |
To: | Andy Lester <andy [...] petdance.com> |
From: | "Sean M. Burke" <sburke [...] cpan.org> |
I've found an interesting (maybe corner-case) behavior of
HTML::TreeBuilder handling s in HTML snippets.
Short Version: is decode to U+00A0 in Unicode strings and
matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's
tighten/delete_ignorable_whitespaces stuff.
Long Version:
HTML::TreeBuilder has options called ignore_ignorable_whitespace and
no_space_compacting.
Here's an interesting script that behaves weirdly:
use Test::More tests => 1;
use HTML::TreeBuilder;
my $body = "<p> </p><p>\x{34df}</p>";
my $t = HTML::TreeBuilder->new;
# Uncomment these two lines and test is now fine
#$t->no_space_compacting(1);
#$t->ignore_ignorable_whitespace(0);
$t->parse($body);
$t->eof;
like $t->guts->as_XML, qr/ /;
So, when you pass Unicode flagged string to HTML::TreeBuilder's
parse() (which I think is the right thing to do to avoid bad HTML
element expansion), will be decoded to Unicode U+00A0 (which is
\xc2\xa0 in UTF-8).
U+00A0 actually matches with the regular expression class \s, while
plain \xa0 (latin-1 expression) doesn't. So both no_space_compacting
and ignore_ignorable_whitespace options are affected by that, since
they use /\S/ regular expression match.
I want HTML::TreeBuilder default parameters stay the same (i.e.
no_space_compacting is OFF, ignore_ignorable_whitespace is ON), but
keeps (or  ) there in HTMLs because they're meaningful, in
some cases.