I'm having a hard time reproducing this bug. I've iterated over the contents of cvwiki-
20090924-pages-articles.xml checking for any undefined values from the text() method on
a page object with no undefined values returned. Here's the test I wrote:
#!/opt/local/bin/perl
use strict;
use warnings;
use Parse::MediaWikiDump;
my $file = shift(@ARGV);
my $dump = Parse::MediaWikiDump::Pages->new($file);
my $count = 0;
while(my $page = $dump->next) {
die "bad" unless defined $page->text;
$count++;
}
print "searched $count pages\n";
and the output
Foodmotron:Parse-MediaWikiDump tyler$ ./50092.pl ~/tmp/cvwiki-20090924-pages-ar
ticles.xml
searched 17313 pages
Foodmotron:Parse-MediaWikiDump tyler$
I think I understand where the undefined values might be coming from but I'm hesitant to fix
something before I understand exactly how it is broken. Can you submit some sample code
that shows the issue?
Thank you,
Tyler Riddle
On Tue Sep 29 08:52:13 2009, amir.aharoni@gmail.com wrote:
Show quoted text> Parse::MediaWikiDump::page::text() returns a reference to undef for
> pages that have an empty <text /> element. From what i've seen, these
> pages have <text xml:space="preserve" /> .
>
> See for example in cvwiki the pages "Категори:Украин чĕлхи" and
> "Категори:Speedy deletion".
>
> There are two ways to improve it:
>
> 1. To return a reference to an empty string.
>
> 2. To add to the documentation of Parse::MediaWikiDump::page that for
> blank pages a reference to undef is returned.