Skip Menu |

This queue is for tickets about the Parse-MediaWikiDump CPAN distribution.

Report information
The Basics
Id: 21526
Status: rejected
Priority: 0/
Queue: Parse-MediaWikiDump

People
Owner: Nobody in particular
Requestors: jidanni [...] jidanni.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: print entire xml <page> intact
Date: Sat, 16 Sep 2006 20:41:35 +0800
To: bug-parse-mediawikidump [...] rt.cpan.org
From: Dan Jacobson <jidanni [...] jidanni.org>
It would be nice to be able to say remove pages from dumps: $ mydumpfilter --remove 'Nurdsburg:*' < dump.xml > newdump.xml #removes all pages in the Nurdsburg namespace. But this means Parse::MediaWikiDump needs a new function to print the entire <page> intact: if($title !~ m/$mypattern/){ #optionally could change a name or date, etc. here, then print $entire_xml_page_instance} Anyway, I could do while ( defined( my $page = $pages->next ) ) { my $title = $page->title; my $timestamp = $page->timestamp; my $text = $page->text; print <<EOF <page> <title>$title</title> <revision> <timestamp>$timestamp</timestamp> <contributor> <username>Jidanni</username> </contributor> <text xml:space="preserve">$$text</text> </revision> </page> EOF } However, this will turn the &lt;!--fofofof--&gt; in the dump into <!--fofofof-->
Hello Jidanni, I finally got around to reading this feature request. I think it is a useful idea however I don't think it belongs in the module because it's not easy to track when it would need to be added to when the dump file changes; this would lead it to lose information in that case and silent data loss isn't a good thing. Thank you for submitting a feature request however. Tyler On Sat Sep 16 10:27:16 2006, jidanni@jidanni.org wrote: Show quoted text
> It would be nice to be able to say remove pages from dumps: > $ mydumpfilter --remove 'Nurdsburg:*' < dump.xml > newdump.xml > #removes all pages in the Nurdsburg namespace. > > But this means Parse::MediaWikiDump needs a new function to print the > entire <page> intact: > > if($title !~ m/$mypattern/){ > #optionally could change a name or date, etc. here, then > print $entire_xml_page_instance} > > Anyway, I could do > while ( defined( my $page = $pages->next ) ) { > my $title = $page->title; > my $timestamp = $page->timestamp; > my $text = $page->text; > print <<EOF > <page> > <title>$title</title> > <revision> > <timestamp>$timestamp</timestamp> > <contributor> > <username>Jidanni</username> > </contributor> > <text xml:space="preserve">$$text</text> > </revision> > </page> > EOF > } > > However, this will turn the > &lt;!--fofofof--&gt; > in the dump into > <!--fofofof-->
Subject: Re: [rt.cpan.org #21526] print entire xml <page> intact
Date: Fri, 08 May 2009 07:55:51 +0800
To: bug-parse-mediawikidump [...] rt.cpan.org
From: jidanni [...] jidanni.org
OK.