Bug #5274 for HTML-LinkExtractor: URL encoded links

Mon Feb 09 18:45:26 2004 Guest - Ticket created

Subject:

URL encoded links

I've started working with LinkExtractor and am liking it so far. Unfortunately, I've come across a problem with url-encoded links. If a source html contains links with url-encoded '&' chars (as &), the extracted url gets decoded. Viz: $VAR1 = bless( { '_tp' => undef, '_strip' => 0, '_LINKS' => [ { '_TEXT' => '<a href="http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1">http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1</a>', 'href' => 'http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1', 'tag' => 'a' } ] }, 'HTML::LinkExtractor' ); This is a problem for me. Is there any way to ask HTML::LinkExtractor to not unescape the url? I've tried using URI::Escape on the href value but that generates %26 instead of & for '&'. For now, I'm using a simple regex to fix the extracted href. I'm sure there are other characters that will be affected but '&' is the most common one that I have been able to discover in my testing. Thanks, William

Sun Feb 22 06:44:48 2004 PODMASTER [...] cpan.org - Correspondence added

Show quoted text

> > This is a problem for me. Is there any way to ask HTML::LinkExtractor > to not unescape the url?

No, not at the moment (this probably won't change). Show quoted text

>I've tried using URI::Escape on the href > value but that generates %26 instead of & for '&'.

That's what URI::Escape is supposed to do. URI::Escape is not HTML::Entities.

Sun Feb 22 06:44:49 2004 PODMASTER [...] cpan.org - Status changed from 'new' to 'resolved'