Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 18568
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: mjd [...] plover.com
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 3.13
Fixed in: 3.22



Subject: HTML::TreeBuilder mangles decimal number entities
use HTML::TreeBuilder; my $TB = HTML::TreeBuilder->new(); my $html = $TB->parse("This ſoftware has ſome bugs")->eof->element\ ify(); print $html->as_HTML(""); The content output from this program is not the same as the input. The input contains "&#17f" and "&#383". The output has erroneously translated this to "&#17f" and "&#383".
On Thu Apr 06 12:55:24 2006, guest wrote: Show quoted text
> The content output from this program is not the same as the input. The > input contains "&#17f" and "&#383". The output has erroneously > translated this to "&#17f" and "&#383".
There are two things going on here. One is that HTML::TreeBuilder was erroneously re-encoding entities such as ſ by escaping &. This has been fixed in 3.22, which will be released on CPAN this weekend as part of the Chicago Hackathon. The other, unfixable in HTML::TreeBuilder, is that HTML::Parser re-encodes both of the above to ſ instead of their original forms. Since HTML::TreeBuilder's parse method comes from HTML::Parser, this would have to be changed in the XS for HTML::Parser. However, I'm not convinced it's a bug, since they're the same entity when decoded. Will mark as resolved when 3.22 hits CPAN.