Hi Joachim,
I don't debug other people's code but this info might be helpful:
Parse::MediaWikiDump::Pages is a subclass of Parse::MediaWikiDump::Revisions but the
actual error in this instance is the XML parser saying you did not give it a valid document.
Off the top of my head do you need to uncompress the file first?
Tyler
On Tue Jun 08 04:36:47 2010, joachim@arti.vub.ac.be wrote:
Show quoted text> Dear Tyler,
>
> Thanks a lot for the quick reply! I tried your new version and the
> problem
> seems to have been solved, except that now I still get the error from
> Revisions.pm.
>
> However, I never call on this module. Here is the code I use (sorry if
> this is
> a bit long, the point is that I never call upon revisions and still
> get the
> error?):
>
> (also please take into account that I am not a perl programmer at all
> ; )
>
> my $wikidir = "/home/joachim/WPen/";
> my $wikidump = "enwiki-latest-pages-articles.xml.bz2";
> my $wikilinks = "enwiki-latest-pagelinks.sql";
>
>
> my $pmwd = Parse::MediaWikiDump->new;
> my $links = $pmwd->links($wikidir . $wikilinks);
> my $dump = $pmwd->pages($wikidir . $wikidump);
> my %id2namespace;
> my %title2id;
> my %redirects;
>
> open(PAGE2ID, ">" . $wikidir . "Page2ID.dat");
> open(PAGE2ID_R, ">" . $wikidir . "Page2ID_R.dat");
> open(PAGE2PAGE, ">" . $wikidir . "Page2Page.dat");
> open(PAGE_R2PAGE_R, ">" . $wikidir . "Page2Page_R.dat");
> open(REDIRECTS, ">" . $wikidir . "PageRedirects.dat");
>
> binmode(PAGE2ID, ':utf8');
> binmode(PAGE2ID_R, ':utf8');
> binmode(PAGE2PAGE, ':utf8');
> binmode(PAGE_R2PAGE_R, ':utf8');
> binmode(REDIRECTS, ':utf8');
>
> #build a map between namespace ids to namespace names
> foreach (@{$dump->namespaces}) {
> my $id = $_->[0];
> my $name = $_->[1];
>
> $id2namespace{$id} = $name;
> }
>
> # build a map between article titles and article ids
> while(my $page = $dump->next) {
>
> my $id = $page->id;
> my $title = $page->title;
> my $namespace = $page->namespace;
> if ($namespace eq '') {
> print PAGE2ID $title . "," . $id . "\n";
> $title2id{$title} = $id;
> }
> }
>
> # # reset for second sweep
> $dump = $pmwd->pages($wikidir . $wikidump);
>
> # build and write a map for the redirects
> while(my $page = $dump->next) {
>
> my $id = $page->id;
> my $title = $page->title;
> my $namespace = $page->namespace;
> if ($namespace eq '') {
> if ($page->redirect) {
> my $new_id = $title2id{$page->redirect};
> if ($new_id) {
> $redirects{$id} = $new_id;
> $title2id{$title} = $new_id;
> print REDIRECTS $id . "," . $new_id . "\n";
> print PAGE2ID_R $title . "," . $new_id . "\n";
> }
> }
> }
> }
>
> # now build the links file
> while(my $link = $links->next) {
> my $namespace = $link->namespace;
> my $namespace_name = $id2namespace{$namespace};
> if ($namespace_name eq '') {
> my $from_id = $link->from;
> my $to_name = $link->to;
> my $to_id = $title2id{$to_name};
> if ($to_id) {
> print PAGE2PAGE $from_id . "," . $to_id;
> if ($redirects{$from_id}) {
> $from_id = redirects{$from_id};
> }
> if ($redirects{$to_id}) {
> $to_id = redirects{$to_id};
> }
> print PAGE_R2PAGE_R $from_id . "," . $to_id;
> }
> }
> }
>
>