Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 14964
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: jtalbot [...] proionta.gr
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 3.18
Fixed in: 3.22

Attachments


Subject: Attributes of tags get entity-decoded (and even worse, wrongly) when parsed
Running Debian stable with Perl 5.8.4 I'm parsing this content from a string: <a href="page.pl?id=10&sub=20"> When I print it as_HTML, I get <a href="page.pl?id=10&sub;=20"> A semi-colon is mistakenly added after the word 'sub'. Running the Perl debugger shows that the problem is not in printing stage, but in the parsing. I use HTML::TreeBuilder->new_from_content($string) to parse. Here's my program: --------------------------- #!/usr/bin/perl -w use HTML::TreeBuilder; my $page = '<a href="page.pl?id=10&sub=20">'; my $p = HTML::TreeBuilder->new_from_content( $page ); # [debug at this stage shows that $p contains a unicode character instead of '&sub'] print $p->as_HTML(); --------------------------- Until this is fixed, is there a way to disable entity-decoding when parsing?
I have attached a test case based on Test::More. From the comments: # HTML::TreeBuilder invokes HTML::Entities::decode on the contents of # HREF attributes. Some CGI-based sites use lang=en or such for # internationalization. When this parameter is after an ampersand, # the resulting &lang is decoded, breaking the link. "sub" is another # popular one. Thanks. -- Rocco Caputo - http://poe.perl.org/
Download support-html-treebuilder.perl
application/octet-stream 662b

Message body not shown because it is not plain text.

Resolved as part of HTML-Tree 3.22, which will be released this weekend as part of the Chicago Hackathon.