Skip Menu |

This queue is for tickets about the Parse-MediaWikiDump CPAN distribution.

Report information
The Basics
Id: 58196
Status: resolved
Priority: 0/
Queue: Parse-MediaWikiDump

People
Owner: triddle [...] cpan.org
Requestors: joachim [...] arti.vub.ac.be
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)

Attachments


Subject: error "not a MediaWiki link dunp file" due to absence of 'LOCK TABLES ...' line in link dump file?
Date: Mon, 7 Jun 2010 14:45:33 +0200
To: bug-parse-mediawikidump [...] rt.cpan.org
From: Joachim De Beule <joachim [...] arti.vub.ac.be>
Dear, When I try to parse the file http://dumps.wikimedia.org/enwiki/latest/enwiki- latest-pagelinks.sql.gz (after unzipping) with the following code: my $pmwd = Parse::MediaWikiDump->new; my $links = $pmwd->links(enwiki-latest-pagelinks.sql); while(my $link = $links->next) { ... } Then I always get the error "not a MediaWiki link dump file", which is due to an initialization in Links.pl that scans the file for a line 'LOCK TABLES `pagelinks` WRITE;' This line indeed is not part of the links file. What am I doing wrong? Should I use a different links file? Thanks, Joachim.
Subject: Re: [rt.cpan.org #58196] follow up
Date: Mon, 7 Jun 2010 17:30:15 +0200
To: bug-Parse-MediaWikiDump [...] rt.cpan.org
From: Joachim De Beule <joachim [...] arti.vub.ac.be>
Greetings again, This is just to say that when I manually add the required line to the links file, then the following error occurs: "not well-formed (invalid token) at line 1, column 7, byte 7 at /home/joachim/perl/lib/lib/perl5/site_perl/5.8.8//Parse/MediaWikiDump/Revisions.pm line 228" ...
Hello Joachim, This round of dump files seems to be missing the lock table statements - I've attached a test version of Parse::MediaWikiDump which should be able to handle this with out issue, please try it and let me know if it works for you. If so I'll upload it as the next official version. Cheers, Tyler On Mon Jun 07 08:48:13 2010, joachim@arti.vub.ac.be wrote: Show quoted text
> Dear, > > When I try to parse the file > http://dumps.wikimedia.org/enwiki/latest/enwiki- > latest-pagelinks.sql.gz (after unzipping) with the following code: > > my $pmwd = Parse::MediaWikiDump->new; > my $links = $pmwd->links(enwiki-latest-pagelinks.sql); > > while(my $link = $links->next) { > ... > } > > Then I always get the error "not a MediaWiki link dump file", which is > due to > an initialization in Links.pl that scans the file for a line > > 'LOCK TABLES `pagelinks` WRITE;' > > This line indeed is not part of the links file. What am I doing wrong? > Should I > use a different links file? > > Thanks, Joachim.
Subject: Parse-MediaWikiDump-1.0.6_01.tar.gz

Message body not shown because it is not plain text.

Parse::MediaWikiDump::Revisions can only handle XML dump files, not SQL dump files. Cheers, Tyler On Mon Jun 07 11:30:34 2010, joachim@arti.vub.ac.be wrote: Show quoted text
> Greetings again, > > This is just to say that when I manually add the required line to the > links > file, then the following error occurs: > > "not well-formed (invalid token) at line 1, column 7, byte 7 at > /home/joachim/perl/lib/lib/perl5/site_perl/5.8.8//Parse/MediaWikiDump/Revisions.pm > line 228" > > ... >
Subject: Re: [rt.cpan.org #58196] error "not a MediaWiki link dunp file" due to absence of 'LOCK TABLES ...' line in link dump file?
Date: Tue, 8 Jun 2010 10:36:32 +0200
To: bug-Parse-MediaWikiDump [...] rt.cpan.org
From: Joachim De Beule <joachim [...] arti.vub.ac.be>
Dear Tyler, Thanks a lot for the quick reply! I tried your new version and the problem seems to have been solved, except that now I still get the error from Revisions.pm. However, I never call on this module. Here is the code I use (sorry if this is a bit long, the point is that I never call upon revisions and still get the error?): (also please take into account that I am not a perl programmer at all ; ) my $wikidir = "/home/joachim/WPen/"; my $wikidump = "enwiki-latest-pages-articles.xml.bz2"; my $wikilinks = "enwiki-latest-pagelinks.sql"; my $pmwd = Parse::MediaWikiDump->new; my $links = $pmwd->links($wikidir . $wikilinks); my $dump = $pmwd->pages($wikidir . $wikidump); my %id2namespace; my %title2id; my %redirects; open(PAGE2ID, ">" . $wikidir . "Page2ID.dat"); open(PAGE2ID_R, ">" . $wikidir . "Page2ID_R.dat"); open(PAGE2PAGE, ">" . $wikidir . "Page2Page.dat"); open(PAGE_R2PAGE_R, ">" . $wikidir . "Page2Page_R.dat"); open(REDIRECTS, ">" . $wikidir . "PageRedirects.dat"); binmode(PAGE2ID, ':utf8'); binmode(PAGE2ID_R, ':utf8'); binmode(PAGE2PAGE, ':utf8'); binmode(PAGE_R2PAGE_R, ':utf8'); binmode(REDIRECTS, ':utf8'); #build a map between namespace ids to namespace names foreach (@{$dump->namespaces}) { my $id = $_->[0]; my $name = $_->[1]; $id2namespace{$id} = $name; } # build a map between article titles and article ids while(my $page = $dump->next) { my $id = $page->id; my $title = $page->title; my $namespace = $page->namespace; if ($namespace eq '') { print PAGE2ID $title . "," . $id . "\n"; $title2id{$title} = $id; } } # # reset for second sweep $dump = $pmwd->pages($wikidir . $wikidump); # build and write a map for the redirects while(my $page = $dump->next) { my $id = $page->id; my $title = $page->title; my $namespace = $page->namespace; if ($namespace eq '') { if ($page->redirect) { my $new_id = $title2id{$page->redirect}; if ($new_id) { $redirects{$id} = $new_id; $title2id{$title} = $new_id; print REDIRECTS $id . "," . $new_id . "\n"; print PAGE2ID_R $title . "," . $new_id . "\n"; } } } } # now build the links file while(my $link = $links->next) { my $namespace = $link->namespace; my $namespace_name = $id2namespace{$namespace}; if ($namespace_name eq '') { my $from_id = $link->from; my $to_name = $link->to; my $to_id = $title2id{$to_name}; if ($to_id) { print PAGE2PAGE $from_id . "," . $to_id; if ($redirects{$from_id}) { $from_id = redirects{$from_id}; } if ($redirects{$to_id}) { $to_id = redirects{$to_id}; } print PAGE_R2PAGE_R $from_id . "," . $to_id; } } }
Hi Joachim, I don't debug other people's code but this info might be helpful: Parse::MediaWikiDump::Pages is a subclass of Parse::MediaWikiDump::Revisions but the actual error in this instance is the XML parser saying you did not give it a valid document. Off the top of my head do you need to uncompress the file first? Tyler On Tue Jun 08 04:36:47 2010, joachim@arti.vub.ac.be wrote: Show quoted text
> Dear Tyler, > > Thanks a lot for the quick reply! I tried your new version and the > problem > seems to have been solved, except that now I still get the error from > Revisions.pm. > > However, I never call on this module. Here is the code I use (sorry if > this is > a bit long, the point is that I never call upon revisions and still > get the > error?): > > (also please take into account that I am not a perl programmer at all > ; ) > > my $wikidir = "/home/joachim/WPen/"; > my $wikidump = "enwiki-latest-pages-articles.xml.bz2"; > my $wikilinks = "enwiki-latest-pagelinks.sql"; > > > my $pmwd = Parse::MediaWikiDump->new; > my $links = $pmwd->links($wikidir . $wikilinks); > my $dump = $pmwd->pages($wikidir . $wikidump); > my %id2namespace; > my %title2id; > my %redirects; > > open(PAGE2ID, ">" . $wikidir . "Page2ID.dat"); > open(PAGE2ID_R, ">" . $wikidir . "Page2ID_R.dat"); > open(PAGE2PAGE, ">" . $wikidir . "Page2Page.dat"); > open(PAGE_R2PAGE_R, ">" . $wikidir . "Page2Page_R.dat"); > open(REDIRECTS, ">" . $wikidir . "PageRedirects.dat"); > > binmode(PAGE2ID, ':utf8'); > binmode(PAGE2ID_R, ':utf8'); > binmode(PAGE2PAGE, ':utf8'); > binmode(PAGE_R2PAGE_R, ':utf8'); > binmode(REDIRECTS, ':utf8'); > > #build a map between namespace ids to namespace names > foreach (@{$dump->namespaces}) { > my $id = $_->[0]; > my $name = $_->[1]; > > $id2namespace{$id} = $name; > } > > # build a map between article titles and article ids > while(my $page = $dump->next) { > > my $id = $page->id; > my $title = $page->title; > my $namespace = $page->namespace; > if ($namespace eq '') { > print PAGE2ID $title . "," . $id . "\n"; > $title2id{$title} = $id; > } > } > > # # reset for second sweep > $dump = $pmwd->pages($wikidir . $wikidump); > > # build and write a map for the redirects > while(my $page = $dump->next) { > > my $id = $page->id; > my $title = $page->title; > my $namespace = $page->namespace; > if ($namespace eq '') { > if ($page->redirect) { > my $new_id = $title2id{$page->redirect}; > if ($new_id) { > $redirects{$id} = $new_id; > $title2id{$title} = $new_id; > print REDIRECTS $id . "," . $new_id . "\n"; > print PAGE2ID_R $title . "," . $new_id . "\n"; > } > } > } > } > > # now build the links file > while(my $link = $links->next) { > my $namespace = $link->namespace; > my $namespace_name = $id2namespace{$namespace}; > if ($namespace_name eq '') { > my $from_id = $link->from; > my $to_name = $link->to; > my $to_id = $title2id{$to_name}; > if ($to_id) { > print PAGE2PAGE $from_id . "," . $to_id; > if ($redirects{$from_id}) { > $from_id = redirects{$from_id}; > } > if ($redirects{$to_id}) { > $to_id = redirects{$to_id}; > } > print PAGE_R2PAGE_R $from_id . "," . $to_id; > } > } > } > >