Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 19074
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: sburke [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: [Fwd:   and \S (\s) regexp in HTML::TreeBuilder]
Date: Thu, 04 May 2006 01:40:00 -0800
To: Andy Lester <andy [...] petdance.com>
From: "Sean M. Burke" <sburke [...] cpan.org>
I've found an interesting (maybe corner-case) behavior of HTML::TreeBuilder handling &nbsp;s in HTML snippets. Short Version: &nbsp; is decode to U+00A0 in Unicode strings and matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's tighten/delete_ignorable_whitespaces stuff. Long Version: HTML::TreeBuilder has options called ignore_ignorable_whitespace and no_space_compacting. Here's an interesting script that behaves weirdly: use Test::More tests => 1; use HTML::TreeBuilder; my $body = "<p>&nbsp;&nbsp;</p><p>\x{34df}</p>"; my $t = HTML::TreeBuilder->new; # Uncomment these two lines and test is now fine #$t->no_space_compacting(1); #$t->ignore_ignorable_whitespace(0); $t->parse($body); $t->eof; like $t->guts->as_XML, qr/&#160;/; So, when you pass Unicode flagged string to HTML::TreeBuilder's parse() (which I think is the right thing to do to avoid bad HTML element expansion), &nbsp; will be decoded to Unicode U+00A0 (which is \xc2\xa0 in UTF-8). U+00A0 actually matches with the regular expression class \s, while plain \xa0 (latin-1 expression) doesn't. So both no_space_compacting and ignore_ignorable_whitespace options are affected by that, since they use /\S/ regular expression match. I want HTML::TreeBuilder default parameters stay the same (i.e. no_space_compacting is OFF, ignore_ignorable_whitespace is ON), but keeps &nbsp; (or &#160;) there in HTMLs because they're meaningful, in some cases.
From: cjm [...] pobox.com
On Thu May 04 10:39:32 2006, SBURKE wrote: Show quoted text
> Short Version: &nbsp; is decode to U+00A0 in Unicode strings and > matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's > tighten/delete_ignorable_whitespaces stuff.
I guess you missed that I had already submitted a patch for this (including a new test to make sure it works). It just hasn't been applied yet. See http://rt.cpan.org/Public/Bug/Display.html?id=17481
Applied Chris Madsen's patch from RT 17481 which fixes this corner case to svn, and this will be resolved in the next release of HTML-Tree.