Bug #36255 for Parse-MediaWikiDump: Parse::MediaWikiDump::page::namespace may return a string which is not really a namespace

Wed May 28 16:41:15 2008 amir.aharoni [...] mail.huji.ac.il - Ticket created

Subject:

Parse::MediaWikiDump::page::namespace may return a string which is not really a namespace

The namespace() function returns any string which appears before the first colon in the page's title. This may be any string even if it's not the name of one of the namespaces. I think that i managed to solve it and a patch is attached. What i did: * Some cosmetics - converted spaces to tabs (usually i prefer spaces, but most of the file used tabs, so i made it consistent.) * Added use List::Util; - so i can use the function first() * Added the function namespaces_names() to package Parse::MediaWikiDump::Pages. It returns an array ref to a plain list of namespace names and not a complex data structure that also includes the namespace numbers. Since it will be used often to check whether a string before the quotes is a namespace name, i thought that it's reasonable to make access to this list easy and not have to filter out the numbers every time. * In parse_head(): a namespace name is added to @{$data{namespaces_names}. * In parse_page(): the pages namespace is now found here. I didn't make any calculations, but i don't think that's it's very expensive. The algorithm: I use the same regex to find the string before the colon; then i check whether this string is one of the namespaces; if it is, then it's saved to %data as the page's namespace, otherwise the namespace is ''. * Parse::MediaWikiDump::page::namespace() now simply returns the namespace which was already found in parse_page(). * Update POD to include $pages->namespaces_names How i tested it: `make test` passes fine after my modifications. I didn't modify any tests, because i guess that i need to add a page to the dump to be able to test it properly, and i don't have a lot of experience playing with actual MediaWiki installations. However, i did run it with a script that i wrote that searches for pages without interlanguage links and it seems to do the right thing (see http://en.wikipedia.org/wiki/Wikipedia:WPIW/HE ). Any other comments are welcome.

Subject:

true_namespaces.patch

diff -Naur Parse-MediaWikiDump-0.40/lib/Parse/MediaWikiDump.pm Parse-MediaWikiDump-0.40.1/lib/Parse/MediaWikiDump.pm --- Parse-MediaWikiDump-0.40/lib/Parse/MediaWikiDump.pm 2006-06-21 22:39:22.000000000 +0200 +++ Parse-MediaWikiDump-0.40.1/lib/Parse/MediaWikiDump.pm 2008-05-28 21:07:12.850000000 +0200 @@ -16,6 +16,7 @@ use strict; use warnings; +use List::Util; use XML::Parser; #tokens in the buffer are an array ref with the 0th element specifying @@ -30,8 +31,8 @@ $$self{PARSER} = XML::Parser->new(ProtocolEncoding => 'UTF-8'); $$self{PARSER}->setHandlers('Start', \&start_handler, - 'End', \&end_handler); - $$self{EXPAT} = $$self{PARSER}->parse_start(state => $self); + 'End', \&end_handler); + $$self{EXPAT} = $$self{PARSER}->parse_start(state => $self); $$self{BUFFER} = []; $$self{CHUNK_SIZE} = 32768; $$self{BUF_LIMIT} = 10000; @@ -142,6 +143,11 @@ return $$self{HEAD}{namespaces}; } +sub namespaces_names { + my $self = shift; + return $$self{HEAD}{namespaces_names}; +} + sub current_byte { my $self = shift; return $$self{BYTE}; @@ -253,7 +259,10 @@ my $self = shift; my $buffer = shift; my $state = 'start'; - my %data = (namespaces => []); + my %data = ( + namespaces => [], + namespaces_names => [], + ); for (my $i = 0; $i <= $#$buffer; $i++) { my $token = $$buffer[$i]; @@ -375,6 +384,7 @@ } push(@{$data{namespaces}}, [$key, $name]); + push(@{$data{namespaces_names}}, $name); $token = $$buffer[++$i]; @@ -624,6 +634,18 @@ } } + $data{namespace} = ''; + # Many pages just have a : in the title, but it's not necessary + # a namespace designation. + if ($data{title} =~ m/^([^:]+)\:/) { + my $possible_namespace = $1; + if (List::Util::first { /^$possible_namespace$/ } + @{ $self->namespaces_names() }) + { + $data{namespace} = $possible_namespace; + } + } + $data{minor} = 0 unless defined($data{minor}); return \%data; @@ -732,17 +754,7 @@ sub namespace { my $self = shift; - return $$self{CACHE}{namespace} if defined($$self{CACHE}{namespace}); - - my $title = $$self{DATA}{title}; - - if ($title =~ m/^([^:]+)\:/) { - $$self{CACHE}{namespace} = $1; - return $1; - } else { - $$self{CACHE}{namespace} = ''; - return ''; - } + return $$self{DATA}{namespace}; } sub categories { @@ -1081,6 +1093,11 @@ namespace number and the second is the namespace name. In the case of namespace 0 the text stored for the name is ''. +=item $pages->namespaces + +Returns an array reference to the list of namespaces names in the instance, +without namespaces numbers. Main namespace name is ''. + =item $pages->current_byte Returns the number of bytes parsed so far.

Thu May 29 02:10:24 2008 triddle [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #36255] Parse::MediaWikiDump::page::namespace may return a string which is not really a namespace
Date:	Wed, 28 May 2008 23:10:08 -0700
To:	bug-Parse-MediaWikiDump [...] rt.cpan.org
From:	"Tyler Riddle" <triddle [...] gmail.com>

Hi Amir, Thank you, this is a great patch. I've reviewed the changes and I'll surely integrate it. I only have one nit-pick: I'd rather use the string comparison operator in the first() invocation over a regular expression just because it doesn't need a pattern match. Adding in a test case for this bug is simple and something that should be done along with the release for this bugfix. One of the <page> entries in the pages_test.xml file can be duplicated and a link created with a title that exposes the old bug. Adding a test case to the pages.t test is also straight forward, just duplicating the other existing test functions and changing the values specific to the test case. I think those changes as well as moving the test case data into the test directory would make for a great maintenance release. I'm just logging these things for the record, I don't expect you to do them. I'll get around to the next release as soon as I can possibly this week, possibly next. Thanks for letting me know about the bug and especially for fixing it. I'll be sure to put your name and the bug number in the changelog and into the authors section of the POD. Tyler On Wed, May 28, 2008 at 1:41 PM, Amir E. Aharoni via RT < bug-Parse-MediaWikiDump@rt.cpan.org> wrote: Show quoted text

> > Wed May 28 16:41:15 2008: Request 36255 was acted upon. > Transaction: Ticket created by amire80 > Queue: Parse-MediaWikiDump > Subject: Parse::MediaWikiDump::page::namespace may return a string > which > is not really a namespace > Broken in: 0.40 > Severity: Important > Owner: Nobody > Requestors: amir.aharoni@gmail.com > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=36255 > > > > The namespace() function returns any string which appears before the > first colon in the page's title. This may be any string even if it's not > the name of one of the namespaces. > > I think that i managed to solve it and a patch is attached. > > What i did: > > * Some cosmetics - converted spaces to tabs (usually i prefer spaces, > but most of the file used tabs, so i made it consistent.) > > * Added use List::Util; - so i can use the function first() > > * Added the function namespaces_names() to package > Parse::MediaWikiDump::Pages. It returns an array ref to a plain list of > namespace names and not a complex data structure that also includes the > namespace numbers. Since it will be used often to check whether a string > before the quotes is a namespace name, i thought that it's reasonable to > make access to this list easy and not have to filter out the numbers > every time. > > * In parse_head(): a namespace name is added to @{$data{namespaces_names}. > > * In parse_page(): the pages namespace is now found here. I didn't make > any calculations, but i don't think that's it's very expensive. The > algorithm: I use the same regex to find the string before the colon; > then i check whether this string is one of the namespaces; if it is, > then it's saved to %data as the page's namespace, otherwise the > namespace is ''. > > * Parse::MediaWikiDump::page::namespace() now simply returns the > namespace which was already found in parse_page(). > > * Update POD to include $pages->namespaces_names > > How i tested it: > > `make test` passes fine after my modifications. I didn't modify any > tests, because i guess that i need to add a page to the dump to be able > to test it properly, and i don't have a lot of experience playing with > actual MediaWiki installations. However, i did run it with a script that > i wrote that searches for pages without interlanguage links and it seems > to do the right thing (see http://en.wikipedia.org/wiki/Wikipedia:WPIW/HE). > > Any other comments are welcome. >

-- Red Hat's vision for Unix is findgrepsedtee and a package manager that lies.

Thu May 29 02:10:29 2008 The RT System itself - Status changed from 'new' to 'open'

Thu May 29 02:20:05 2008 amir.aharoni [...] mail.huji.ac.il - Correspondence added

From:

amir.aharoni [...] gmail.com

On Thu May 29 02:10:24 2008, triddle@gmail.com wrote: Show quoted text

> Thanks for letting me know about the bug and especially for fixing it. > I'll > be sure to put your name and the bug number in the changelog and into > the > authors section of the POD.

Thanks for the quick reply!

Fri May 30 02:01:03 2008 amir.aharoni [...] mail.huji.ac.il - Correspondence added

From:

amir.aharoni [...] gmail.com

On Thu May 29 02:10:24 2008, triddle@gmail.com wrote: Show quoted text

> Hi Amir, > Thank you, this is a great patch. I've reviewed the changes and I'll > surely > integrate it. I only have one nit-pick: I'd rather use the string > comparison > operator in the first() invocation over a regular expression just > because it > doesn't need a pattern match.

You are very right about it. $_ eq $possible_namespace is indeed simpler. There was also a nasty bug there - if $possible_namespace has metacharacters the regex may become invalid and crash the program in the middle of a long dump processing session. I forget quotemeta all the time. (Perl 6 is supposed to fix this one fine day...)

Sat May 31 12:59:43 2008 triddle [...] cpan.org - Correspondence added

From:

triddle [...] cpan.org

Hello again, I created a test case to identify the bug you reported, verified the test case did in fact catch the bug, then applied your patch. I think you will be very pleased to know that after applying your patch and with out any further modifications the bug was fixed. Thank you! :-) I have attached 0.51 to this bug report, can you please test it and let me know if it solves your bug? If not, could you update the test data in t/pages_test.xml to reproduce the problem, then send me a diff of the test data? Thanks again, Tyler On Wed May 28 16:41:15 2008, amire80 wrote: Show quoted text

> The namespace() function returns any string which appears before the > first colon in the page's title. This may be any string even if it's not > the name of one of the namespaces. > > I think that i managed to solve it and a patch is attached. > > What i did: > > * Some cosmetics - converted spaces to tabs (usually i prefer spaces, > but most of the file used tabs, so i made it consistent.) > > * Added use List::Util; - so i can use the function first() > > * Added the function namespaces_names() to package > Parse::MediaWikiDump::Pages. It returns an array ref to a plain list of > namespace names and not a complex data structure that also includes the > namespace numbers. Since it will be used often to check whether a string > before the quotes is a namespace name, i thought that it's reasonable to > make access to this list easy and not have to filter out the numbers > every time. > > * In parse_head(): a namespace name is added to @{$data{namespaces_names}. > > * In parse_page(): the pages namespace is now found here. I didn't make > any calculations, but i don't think that's it's very expensive. The > algorithm: I use the same regex to find the string before the colon; > then i check whether this string is one of the namespaces; if it is, > then it's saved to %data as the page's namespace, otherwise the > namespace is ''. > > * Parse::MediaWikiDump::page::namespace() now simply returns the > namespace which was already found in parse_page(). > > * Update POD to include $pages->namespaces_names > > How i tested it: > > `make test` passes fine after my modifications. I didn't modify any > tests, because i guess that i need to add a page to the dump to be able > to test it properly, and i don't have a lot of experience playing with > actual MediaWiki installations. However, i did run it with a script that > i wrote that searches for pages without interlanguage links and it seems > to do the right thing (see http://en.wikipedia.org/wiki/Wikipedia:WPIW/HE ). > > Any other comments are welcome.

Download Parse-MediaWikiDump-0.51.tar.gz
application/x-gzip 13.7k

Message body not shown because it is not plain text.

Sun Jun 01 16:02:20 2008 amir.aharoni [...] mail.huji.ac.il - Correspondence added

From:

amir.aharoni [...] gmail.com

On Sat May 31 12:59:43 2008, TRIDDLE wrote: Show quoted text

> Hello again, > > I have attached 0.51 to this bug report, can you please test it and > let me know if it solves > your bug? If not, could you update the test data in t/pages_test.xml > to reproduce the > problem, then send me a diff of the test data?

Tested with the dump i usually work on. Works fine for me. The only conceivable problem was with the installation. When i tried to run `make test' i got: /usr/bin/perl.exe "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t t/00-load.........t/00-load.t is not readable t/links-compat....t/links-compat.t is not readable t/links...........t/links.t is not readable t/pages-compat....t/pages-compat.t is not readable t/pages...........t/pages.t is not readable FAILED--5 test scripts could be run, alas--no output ever seen make: *** [test_dynamic] Error 255 I chmod'ed everything to 777 and it was gone. (Maybe it's because i have Cygwin on Vista here and something was screwed up with the permissions.)

Sun Jun 01 18:15:13 2008 triddle [...] cpan.org - Correspondence added

Glad to hear the bug has been resolved, thanks so much for your patch. I double checked the tests in the release and they worked ok on my Mac, I suspect it was a Vista related problem you ran into. Regardless, the cpan testers will let me know if there are problems running the tests. Tyler On Sun Jun 01 16:02:20 2008, amire80 wrote: Show quoted text

> On Sat May 31 12:59:43 2008, TRIDDLE wrote:

> > Hello again, > > > > I have attached 0.51 to this bug report, can you please test it and > > let me know if it solves > > your bug? If not, could you update the test data in t/pages_test.xml > > to reproduce the > > problem, then send me a diff of the test data?

> > Tested with the dump i usually work on. Works fine for me. > > The only conceivable problem was with the installation. When i tried to > run `make test' i got: > > /usr/bin/perl.exe "-MExtUtils::Command::MM" "-e" "test_harness(0, > 'blib/lib', 'blib/arch')" t/*.t > t/00-load.........t/00-load.t is not readable > t/links-compat....t/links-compat.t is not readable > t/links...........t/links.t is not readable > t/pages-compat....t/pages-compat.t is not readable > t/pages...........t/pages.t is not readable > FAILED--5 test scripts could be run, alas--no output ever seen > make: *** [test_dynamic] Error 255 > > I chmod'ed everything to 777 and it was gone. (Maybe it's because i have > Cygwin on Vista here and something was screwed up with the permissions.)

Sun Jun 01 18:15:49 2008 triddle [...] cpan.org - Correspondence added

Bug reporter has indicated 0.51 functions properly for him.

Sun Jun 01 18:15:51 2008 triddle [...] cpan.org - Status changed from 'open' to 'resolved'