Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 14260
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: nothingmuch [...] woobling.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: 3.22



Subject: HTML::Element's _xml_escape should be left to a filter that knows that the encodings involved are
_xml_escape as applied by as_XML, called by Class::DBI::AsForm was causing data corruption during round tripping when unicode was involved. My workaround was to assign an empty sub to _xml_escape. My guess is that data was decoded as latin 1 or something by the browser (Despite meta http-equiv specifying utf-8, as well as the server agreeing with it WRT to the Content-Type header). This data was then sent back to the server, but it was unicode reinterpreted as latin 1, converted into unicode, so wide characters were made into accented narrow ones from the latin 1 space. Anyway, my point is that since HTML::Element has no control over where it's output data will be fed to eventually this should be an optional feature, that can be easily disabled or replaced, where another filter to replace unprintable characters can be applied to the string resulting from 'as_XML' by the output handler (for example a catalyst plugin, that hooks on output, or a special perl io mode). Ciao, and thanks!
From: somewhere [...] confuzzled.lu
Hi, The same function seems to be used in HTML::Widget in the process of filling fields with values. If there are non standart letters however (like éààé), all the letters get converted to their HTML::Entity counterpart. (I haven't looked at the both modules, so I don't know if the Html::widget author is using your module correctly.) As mentioned on http://lists.rawmode.org/pipermail/catalyst/2006-May/007646.html the following _xml_escape function would solve the problem sub _xml_escape { # DESTRUCTIVE (a.k.a. "in-place") foreach my $x (@_) { $x =~ s~([<&>])~'&#'.(ord($1)).';'~seg; } return; } Thibaut On Mon Aug 22 10:54:56 2005, NUFFIN wrote: Show quoted text
> _xml_escape as applied by as_XML, called by Class::DBI::AsForm was > causing data corruption during round tripping when unicode was > involved. > > My workaround was to assign an empty sub to _xml_escape. > > > My guess is that data was decoded as latin 1 or something by the > browser (Despite meta http-equiv specifying utf-8, as well as the > server agreeing with it WRT to the Content-Type header). > > This data was then sent back to the server, but it was unicode > reinterpreted as latin 1, converted into unicode, so wide > characters were made into accented narrow ones from the latin 1 > space. > > Anyway, my point is that since HTML::Element has no control over where > it's output data will be fed to eventually this should be an > optional feature, that can be easily disabled or replaced, where > another filter to replace unprintable characters can be applied to > the string resulting from 'as_XML' by the output handler (for > example a catalyst plugin, that hooks on output, or a special perl > io mode). > > Ciao, and thanks!
_xml_escape now only escapes five values. Four are <, >, ' and ", and they are always escaped. The fifth is &, but it is only escaped if it is not part of an already existing escape. The escapes recognized are &[a-z0-9]+; (e.g. &lt;) and &#\d+; (e.g &#62;). This allows, for example, &nbsp; to pass through unharmed so that an intended non-breaking space doesn't get double-escaped to &amp;nbsp; and produce unexpected behavior. Added test t/escape.t to prove this behavior.
This fix will be released to CPAN this weekend as part of the Chicago Hackathon.