Bug #89625 for Text-Corpus-VoiceOfAmerica: VoiceOfAmerica.pm doesn't expect what VOA now sends: a sitemap *index* (and with individual sitemaps .xml.gz)

CC:	jeff.kubina [...] gmail.com
Subject:	VoiceOfAmerica.pm doesn't expect what VOA now sends: a sitemap index (and with individual sitemaps .xml.gz)
Date:	Sat, 19 Oct 2013 13:29:36 -0400
To:	bug-Text-Corpus-VoiceOfAmerica [...] rt.cpan.org
From:	William Niebel <billniebel [...] comcast.net>

Jeff- I recognized that the Voice of America transcripts would help me do text analysis. I've used Perl for many years and so was happy to find your Text::Corpus::VoiceOfAmerica and installed it this morning. It looks like it will save me lots of time. Thanks. I didn't find a bug per se, but the Perl module seems to no longer work because of a change made at VOA. I looks like it expects a simple sitemap file from 'http://www1.voanews.com/sitemap.xml' and in fact VOA still responds 200 with content but now returns a sitemap *index* file instead. Simple test script output includes "no urls found via XML parsing, 14 found using regular expression." because sitemap index uses loc tag, but not in the sitemap nesting '/x:urlset/x:url/x:loc' Another complication: the several individual sitemap files, referenced by the VOA-returned sitemap, are now all .xml.gz, not simply .xml I'll tarry a bit in case you jump on this, and can fashion a workaround if not. Again, many thanks for your module. I'm looking forward to using it. -Bill