Bug #19074 for HTML-Tree: [Fwd:   and \S (\s) regexp in HTML::TreeBuilder]

Thu May 04 10:39:32 2006 sburke [...] cpan.org - Ticket created

Subject:	[Fwd:   and \S (\s) regexp in HTML::TreeBuilder]
Date:	Thu, 04 May 2006 01:40:00 -0800
To:	Andy Lester <andy [...] petdance.com>
From:	"Sean M. Burke" <sburke [...] cpan.org>

I've found an interesting (maybe corner-case) behavior of HTML::TreeBuilder handling  s in HTML snippets. Short Version:   is decode to U+00A0 in Unicode strings and matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's tighten/delete_ignorable_whitespaces stuff. Long Version: HTML::TreeBuilder has options called ignore_ignorable_whitespace and no_space_compacting. Here's an interesting script that behaves weirdly: use Test::More tests => 1; use HTML::TreeBuilder; my $body = "<p>  </p><p>\x{34df}</p>"; my $t = HTML::TreeBuilder->new; # Uncomment these two lines and test is now fine #$t->no_space_compacting(1); #$t->ignore_ignorable_whitespace(0); $t->parse($body); $t->eof; like $t->guts->as_XML, qr/ /; So, when you pass Unicode flagged string to HTML::TreeBuilder's parse() (which I think is the right thing to do to avoid bad HTML element expansion),   will be decoded to Unicode U+00A0 (which is \xc2\xa0 in UTF-8). U+00A0 actually matches with the regular expression class \s, while plain \xa0 (latin-1 expression) doesn't. So both no_space_compacting and ignore_ignorable_whitespace options are affected by that, since they use /\S/ regular expression match. I want HTML::TreeBuilder default parameters stay the same (i.e. no_space_compacting is OFF, ignore_ignorable_whitespace is ON), but keeps   (or  ) there in HTMLs because they're meaningful, in some cases.

Sat Jul 29 11:25:56 2006 cjm [...] cpan.org - Correspondence added

From:

cjm [...] pobox.com

On Thu May 04 10:39:32 2006, SBURKE wrote: Show quoted text

> Short Version:   is decode to U+00A0 in Unicode strings and > matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's > tighten/delete_ignorable_whitespaces stuff.

I guess you missed that I had already submitted a patch for this (including a new test to make sure it works). It just hasn't been applied yet. See http://rt.cpan.org/Public/Bug/Display.html?id=17481

Sat Jul 29 11:25:57 2006 The RT System itself - Status changed from 'new' to 'open'

Sun Aug 06 00:57:44 2006 PETEK [...] cpan.org - Correspondence added

Applied Chris Madsen's patch from RT 17481 which fixes this corner case to svn, and this will be resolved in the next release of HTML-Tree.

Sun Aug 06 00:59:36 2006 PETDANCE [...] cpan.org - Status changed from 'open' to 'resolved'

Bug #19074 for HTML-Tree: [Fwd: &nbsp; and \S (\s) regexp in HTML::TreeBuilder]

Bug #19074 for HTML-Tree: [Fwd: and \S (\s) regexp in HTML::TreeBuilder]