Skip Menu |

This queue is for tickets about the Text-Corpus-VoiceOfAmerica CPAN distribution.

Report information
The Basics
Id: 89625
Status: new
Priority: 0/
Queue: Text-Corpus-VoiceOfAmerica

People
Owner: Nobody in particular
Requestors: billniebel [...] comcast.net
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



CC: jeff.kubina [...] gmail.com
Subject: VoiceOfAmerica.pm doesn't expect what VOA now sends: a sitemap *index* (and with individual sitemaps .xml.gz)
Date: Sat, 19 Oct 2013 13:29:36 -0400
To: bug-Text-Corpus-VoiceOfAmerica [...] rt.cpan.org
From: William Niebel <billniebel [...] comcast.net>
Jeff- I recognized that the Voice of America transcripts would help me do text analysis. I've used Perl for many years and so was happy to find your Text::Corpus::VoiceOfAmerica and installed it this morning. It looks like it will save me lots of time. Thanks. I didn't find a bug per se, but the Perl module seems to no longer work because of a change made at VOA. I looks like it expects a simple sitemap file from 'http://www1.voanews.com/sitemap.xml' and in fact VOA still responds 200 with content but now returns a sitemap *index* file instead. Simple test script output includes "no urls found via XML parsing, 14 found using regular expression." because sitemap index uses loc tag, but not in the sitemap nesting '/x:urlset/x:url/x:loc' Another complication: the several individual sitemap files, referenced by the VOA-returned sitemap, are now all .xml.gz, not simply .xml I'll tarry a bit in case you jump on this, and can fashion a workaround if not. Again, many thanks for your module. I'm looking forward to using it. -Bill