Subject: | print entire xml <page> intact |
Date: | Sat, 16 Sep 2006 20:41:35 +0800 |
To: | bug-parse-mediawikidump [...] rt.cpan.org |
From: | Dan Jacobson <jidanni [...] jidanni.org> |
It would be nice to be able to say remove pages from dumps:
$ mydumpfilter --remove 'Nurdsburg:*' < dump.xml > newdump.xml
#removes all pages in the Nurdsburg namespace.
But this means Parse::MediaWikiDump needs a new function to print the
entire <page> intact:
if($title !~ m/$mypattern/){
#optionally could change a name or date, etc. here, then
print $entire_xml_page_instance}
Anyway, I could do
while ( defined( my $page = $pages->next ) ) {
my $title = $page->title;
my $timestamp = $page->timestamp;
my $text = $page->text;
print <<EOF
<page>
<title>$title</title>
<revision>
<timestamp>$timestamp</timestamp>
<contributor>
<username>Jidanni</username>
</contributor>
<text xml:space="preserve">$$text</text>
</revision>
</page>
EOF
}
However, this will turn the
<!--fofofof-->
in the dump into
<!--fofofof-->