Skip Menu |

This queue is for tickets about the Parse-MediaWikiDump CPAN distribution.

Report information
The Basics
Id: 50092
Status: resolved
Priority: 0/
Queue: Parse-MediaWikiDump

People
Owner: Nobody in particular
Requestors: amir.aharoni [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 0.94
Fixed in: (no value)

Attachments


Subject: Some apparently valid pages return undefined text
Parse::MediaWikiDump::page::text() returns a reference to undef for pages that have an empty <text /> element. From what i've seen, these pages have <text xml:space="preserve" /> . See for example in cvwiki the pages "Категори:Украин чĕлхи" and "Категори:Speedy deletion". There are two ways to improve it: 1. To return a reference to an empty string. 2. To add to the documentation of Parse::MediaWikiDump::page that for blank pages a reference to undef is returned.
Thank you for the bug report. I think in the instance here the proper thing to do would be to return a reference to an empty string. I'll create a test for this bug and resolve it some time in the next week or so most likely. Tyler On Tue Sep 29 08:52:13 2009, amir.aharoni@gmail.com wrote: Show quoted text
> Parse::MediaWikiDump::page::text() returns a reference to undef for > pages that have an empty <text /> element. From what i've seen, these > pages have <text xml:space="preserve" /> . > > See for example in cvwiki the pages "Категори:Украин чĕлхи" and > "Категори:Speedy deletion". > > There are two ways to improve it: > > 1. To return a reference to an empty string. > > 2. To add to the documentation of Parse::MediaWikiDump::page that for > blank pages a reference to undef is returned.
I'm having a hard time reproducing this bug. I've iterated over the contents of cvwiki- 20090924-pages-articles.xml checking for any undefined values from the text() method on a page object with no undefined values returned. Here's the test I wrote: #!/opt/local/bin/perl use strict; use warnings; use Parse::MediaWikiDump; my $file = shift(@ARGV); my $dump = Parse::MediaWikiDump::Pages->new($file); my $count = 0; while(my $page = $dump->next) { die "bad" unless defined $page->text; $count++; } print "searched $count pages\n"; and the output Foodmotron:Parse-MediaWikiDump tyler$ ./50092.pl ~/tmp/cvwiki-20090924-pages-ar ticles.xml searched 17313 pages Foodmotron:Parse-MediaWikiDump tyler$ I think I understand where the undefined values might be coming from but I'm hesitant to fix something before I understand exactly how it is broken. Can you submit some sample code that shows the issue? Thank you, Tyler Riddle On Tue Sep 29 08:52:13 2009, amir.aharoni@gmail.com wrote: Show quoted text
> Parse::MediaWikiDump::page::text() returns a reference to undef for > pages that have an empty <text /> element. From what i've seen, these > pages have <text xml:space="preserve" /> . > > See for example in cvwiki the pages "Категори:Украин чĕлхи" and > "Категори:Speedy deletion". > > There are two ways to improve it: > > 1. To return a reference to an empty string. > > 2. To add to the documentation of Parse::MediaWikiDump::page that for > blank pages a reference to undef is returned.
I was able to confirm and fix this bug. I also added a test for it to make sure it won't crop up again in the future. Version 0.95 is attached to this ticket so you can install it before the CPAN mirrors catch up. Thanks for the bug report, Tyler Riddle
Download Parse-MediaWikiDump-0.95.tar.gz
application/x-gzip 16.2k

Message body not shown because it is not plain text.