Skip Menu |

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Dancer-Plugin-SiteMap CPAN distribution.

Report information
The Basics
Id: 94833
Status: resolved
Priority: 0/
Queue: Dancer-Plugin-SiteMap

People
Owner: james [...] ronanweb.co.uk
Requestors: spudsoup [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 0.12
Fixed in: (no value)



Subject: Wishlist: Change the sitemap URL & parse robots.txt
I have attached a patch file with two proposed enhancements to Dancer::Plugin::SiteMap 1. Allow the URL of the sitemap to be modified or disabled entirely via the Dancer config. 2. Populate the sitemap_ignore list by parsing the robots.txt file for the site. This is so that you don't have to edit in two places to block a URL from the crawler.
Subject: sitemap.patch
--- Dancer/Plugin/SiteMap_ORIG.pm 2014-04-17 14:56:28.421180099 +0100 +++ Dancer/Plugin/SiteMap.pm 2014-04-17 11:36:40.341308004 +0100 @@ -17,7 +17,7 @@ =cut -our $VERSION = '0.12'; +our $VERSION = '0.13'; my $OMIT_ROUTES = []; # Add syntactic sugar for omitting routes. @@ -35,13 +35,48 @@ # Add the routes for both the XML sitemap and the standalone one. -get '/sitemap.xml' => sub { - _xml_sitemap(); -}; +# The path to the route can be defined in the plugin configuration +my $conf = plugin_setting(); -get '/sitemap' => sub { - _html_sitemap(); -}; +if ( defined $conf->{'xml_route'} ) { + # Non default route for XML sitemap + get $conf->{'xml_route'} => sub { _xml_sitemap() }; +} +elsif ( exists $conf->{'xml_route'} ) { + # XML sitemap is disabled. Do not add a route. +} +else { + # Default route for XML sitemap + get '/sitemap.xml' => sub { _xml_sitemap() }; +} + +if ( defined $conf->{'html_route'} ) { + # Non default route for HTML sitemap + get $conf->{'html_route'} => sub { _html_sitemap() }; +} +elsif ( exists $conf->{'html_route'} ) { + # HTML sitemap is disabled. Do not add a route. +} +else { + # Default route for HTML sitemap + get '/sitemap' => sub { _html_sitemap() }; +} + +if ( defined $conf->{'robots_disallow'} ) { + # Read the Disallow lines from robots.txt and add to $OMIT_ROUTES + my $robots_txt = $conf->{'robots_disallow'}; + my @disallowed_list = (); + open my $inFH, '<', $robots_txt or die "Error reading $robots_txt $!"; + + while ( my $line = <$inFH> ) { + if ( $line =~ m/Disallow: \s*(\/.*)$/ ) { + push @disallowed_list, $1; + } + } + + close $inFH; + sitemap_ignore(@disallowed_list); +} =head1 SYNOPSIS @@ -52,6 +87,23 @@ sitemap_ignore ('ignore/this/route', 'orthese/.*'); +Or omit all routes disallowed in robots.txt. +In the config.yml of the application: + + plugins: + SiteMap: + robots_disallow: /local/path/to/robots.txt + +You can also change the default route for the sitemap by adding fields to +the plugin config. + +eg, in the config.yml of the application: + + plugins: + SiteMap: + xml_route: /sitemap_static.xml + html_route: # html sitemap is disabled. + =head1 DESCRIPTION B<This plugin now supports Dancer 1 and 2!>
Hi David, Thanks for your contribution! Always good to have other people's input. Also, apologies for the delayed response. I've been away and only returned this morning. I've patched D:P:SiteMap and created a branch for your features here: https://github.com/jamesronan/Dancer-Plugin-SiteMap/tree/features/RT94833 The features you've submitted are great and I'm certainly going to pull them into the Plugin once I've tested them. There is a reason I haven't patched straight into master, however; Having briefly looked at the code it seems to me that if someone were to opt to use the robots file as a source for the omitted routes via the config, and also specify another omission when using the plugin in code, one would trample the other. Obviously that is an "undocumented feature" that could cause someone issues. So I think I'm going to look at re-working the sitemap_ignore keyword functionality to add routes to the omissions list rather than just assigning a new list from a given source. You should be able to follow changes on the GitHub repo - if you use GitHub - however I will keep you posted via RT, here. Thanks again, and feel free to comment / keep the suggestions/patches coming! :D Cheers, JamesR On Thu Apr 17 10:11:31 2014, SPUDSOUP wrote: Show quoted text
> I have attached a patch file with two proposed enhancements to > Dancer::Plugin::SiteMap > > 1. Allow the URL of the sitemap to be modified or disabled entirely > via the Dancer config. > > 2. Populate the sitemap_ignore list by parsing the robots.txt file for > the site. This is so that you don't have to edit in two places to > block a URL from the crawler.
Hi there! I loved this patch so much I made some tests (and fixed some minor issues, and documented everything as thoroughly as I could). https://github.com/jamesronan/Dancer-Plugin-SiteMap/pull/1 Hope this helps to get this onto CPAN as soon as possible! Cheers, Breno
Hi Gents, Many thanks to both of you for your work on this. I've put the patch, and the additions into the Plugin (along with a CONTIBUTORS section in the POD ;) ) Dancer-Plugin-SiteMap-0.13 is now on CPAN and should show up via the searches shortly. Thanks again! JamesR On Tue Apr 22 20:31:47 2014, GARU wrote: Show quoted text
> Hi there! > > I loved this patch so much I made some tests (and fixed some minor > issues, and documented everything as thoroughly as I could). > > https://github.com/jamesronan/Dancer-Plugin-SiteMap/pull/1 > > Hope this helps to get this onto CPAN as soon as possible! > > Cheers, > > Breno