Bug #102221 for Perlanet: URLEncoding issue with metacpan.org "News" feed

Thu Feb 19 15:00:19 2015 grtodd [...] gmail.com - Ticket created

Subject:

URLEncoding issue with metacpan.org "News" feed

MetaCPAN's "Recent" feed works well and Perlanet creates links like the following for aggregation: http://metacpan.org/release/DWHEELER/Pod-Simple-3.29_6 The "News" feed at MetaCPAN (https://metacpan.org/feed/news) however uses URLs and links like this (with anchors): https://metacpan.org/news#sslimprovements which are lowercase with white space removed (note the "#"). When Perlnaet tries to create an aggregation from this feed it URL encodes "#" as %@# the resulting links look like: http://metacpan.org/news%23SSL%20improvements and thus break since if # is urlencoded it is not seen as an anchor, but as a literal character in the path. I have no idea if this is a Perlanet bug or not nor how or where to fix it. There may be some sort of discrepancy between the RDF/Atom feed describing the page and the actual source of he actual page. A work around might be to add "/" to the end of the URL which causes "%23" to be seen as an anchor. For example: http://metacpan.org/news/%23SSL%20improvements does find the page - if not the actual anchor location. Or perhaps adjusting settings when the HTML::Scrubber object is created - but I haven't investigated further.

Fri Feb 20 08:16:51 2015 DAVECROSS [...] cpan.org - Taken

Fri Feb 20 08:34:33 2015 DAVECROSS [...] cpan.org - Correspondence added

On Thu Feb 19 15:00:19 2015, grtodd@gmail.com wrote: Show quoted text

> MetaCPAN's "Recent" feed works well and Perlanet creates links like > the following for aggregation: > > http://metacpan.org/release/DWHEELER/Pod-Simple-3.29_6 > > The "News" feed at MetaCPAN (https://metacpan.org/feed/news) however > uses URLs and links like this (with anchors): > > https://metacpan.org/news#sslimprovements > > which are lowercase with white space removed (note the "#"). When > Perlnaet tries to create an aggregation from this feed it URL encodes > "#" as %@# the resulting links look like: > > http://metacpan.org/news%23SSL%20improvements > > and thus break since if # is urlencoded it is not seen as an anchor, > but as a literal character in the path. > > I have no idea if this is a Perlanet bug or not nor how or where to > fix it. There may be some sort of discrepancy between the RDF/Atom > feed describing the page and the actual source of he actual page. > > A work around might be to add "/" to the end of the URL which causes > "%23" to be seen as an anchor. For example: > > http://metacpan.org/news/%23SSL%20improvements > > does find the page - if not the actual anchor location. Or perhaps > adjusting settings when the HTML::Scrubber object is created - but I > haven't investigated further.

Hi, It looks like there are a few things going on here. Firstly, there's no problem with the feed handling. If you're generating a feed file and you look at the URLs that are in that, then you'll see that they are correct. Secondly, MetaCPAN are creating invalid URLs. They all have spaces in - and spaces shouldn't exist in URLs. They should all be encoded to %20 or +. The URL you give as an example (https://metacpan.org/news#sslimprovements) doesn't exist in their feed. It's actually "http://metacpan.org/news#SSL improvements". Thirdly, MetaCPAN are creating URLs that contain fragments which link to <a> elements that don't exist. If they publish a URL like https://metacpan.org/news#sslimprovements then you'd expect to find an <a> element like <a name="sslimprovements">. That doesn't exist in the HTML source. So, even if Perlanet worked as expected, your links wouldn't work because the MetaCPAN site is broken. I'll see if I can submit a patch to them to fix those issues. But, there is still a problem with the page that Perlanet is generating for you. I don't think that it should change '#' to '%23'. That's happening because in the sample TT file which I provide (and which, I assume you copied) I use the 'uri' filter to clean up URLs for display. A quick fix would be to remove the 'url' filter. But I need to think about what other effects that might have. I think it's good practice to have it there (in most cases). It might be a bug in TT's 'uri' filter. It might need to add '#' to the list of characters that it doesn't touch. Thanks for the report. Cheers, Dave...

Fri Feb 20 08:34:33 2015 The RT System itself - Status changed from 'new' to 'open'