Subject: | URL encoded links |
I've started working with LinkExtractor and am liking it so far. Unfortunately, I've come across a problem with url-encoded links. If a source html contains links with url-encoded '&' chars (as &), the extracted url gets decoded. Viz:
$VAR1 = bless( {
'_tp' => undef,
'_strip' => 0,
'_LINKS' => [
{
'_TEXT' => '<a href="http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1">http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1</a>',
'href' => 'http://www.perl.com/cgi-bin/hw.cgi?key=val&key1=val1',
'tag' => 'a'
}
]
}, 'HTML::LinkExtractor' );
This is a problem for me. Is there any way to ask HTML::LinkExtractor to not unescape the url? I've tried using URI::Escape on the href value but that generates %26 instead of & for '&'.
For now, I'm using a simple regex to fix the extracted href. I'm sure there are other characters that will be affected but '&' is the most common one that I have been able to discover in my testing.
Thanks,
William