Skip Menu |

This queue is for tickets about the HTML-LinkExtractor CPAN distribution.

Report information
The Basics
Id: 5274
Status: resolved
Priority: 0/
Queue: HTML-LinkExtractor

People
Owner: Nobody in particular
Requestors: william [...] knowmad.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.09
Fixed in: (no value)



Subject: URL encoded links
I've started working with LinkExtractor and am liking it so far. Unfortunately, I've come across a problem with url-encoded links. If a source html contains links with url-encoded '&' chars (as &amp;), the extracted url gets decoded. Viz: $VAR1 = bless( { '_tp' => undef, '_strip' => 0, '_LINKS' => [ { '_TEXT' => '<a href="http://www.perl.com/cgi-bin/hw.cgi?key=val&amp;key1=val1">http://www.perl.com/cgi-bin/hw.cgi?key=val&amp;key1=val1</a>', 'href' => 'http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1', 'tag' => 'a' } ] }, 'HTML::LinkExtractor' ); This is a problem for me. Is there any way to ask HTML::LinkExtractor to not unescape the url? I've tried using URI::Escape on the href value but that generates %26 instead of &amp; for '&'. For now, I'm using a simple regex to fix the extracted href. I'm sure there are other characters that will be affected but '&' is the most common one that I have been able to discover in my testing. Thanks, William
Show quoted text
> > This is a problem for me. Is there any way to ask HTML::LinkExtractor > to not unescape the url?
No, not at the moment (this probably won't change). Show quoted text
>I've tried using URI::Escape on the href > value but that generates %26 instead of &amp; for '&'.
That's what URI::Escape is supposed to do. URI::Escape is not HTML::Entities.