CC: | jeff.kubina [...] gmail.com |
Subject: | VoiceOfAmerica.pm doesn't expect what VOA now sends: a sitemap *index* (and with individual sitemaps .xml.gz) |
Date: | Sat, 19 Oct 2013 13:29:36 -0400 |
To: | bug-Text-Corpus-VoiceOfAmerica [...] rt.cpan.org |
From: | William Niebel <billniebel [...] comcast.net> |
Jeff-
I recognized that the Voice of America transcripts would help me do text analysis. I've used Perl for many years and so was happy to find your Text::Corpus::VoiceOfAmerica and installed it this morning. It looks like it will save me lots of time. Thanks.
I didn't find a bug per se, but the Perl module seems to no longer work because of a change made at VOA.
I looks like it expects a simple sitemap file from 'http://www1.voanews.com/sitemap.xml'
and in fact VOA still responds 200 with content but now returns a sitemap *index* file instead.
Simple test script output includes "no urls found via XML parsing, 14 found using regular expression." because sitemap index uses loc tag, but not in the sitemap nesting '/x:urlset/x:url/x:loc'
Another complication: the several individual sitemap files, referenced by the VOA-returned sitemap, are now all .xml.gz, not simply .xml
I'll tarry a bit in case you jump on this, and can fashion a workaround if not. Again, many thanks for your module. I'm looking forward to using it.
-Bill