Bug #21526 for Parse-MediaWikiDump: print entire xml <page> intact

Sat Sep 16 10:27:16 2006 jidanni [...] jidanni.org - Ticket created

Subject:	print entire xml <page> intact
Date:	Sat, 16 Sep 2006 20:41:35 +0800
To:	bug-parse-mediawikidump [...] rt.cpan.org
From:	Dan Jacobson <jidanni [...] jidanni.org>

It would be nice to be able to say remove pages from dumps: $ mydumpfilter --remove 'Nurdsburg:*' < dump.xml > newdump.xml #removes all pages in the Nurdsburg namespace. But this means Parse::MediaWikiDump needs a new function to print the entire <page> intact: if($title !~ m/$mypattern/){ #optionally could change a name or date, etc. here, then print $entire_xml_page_instance} Anyway, I could do while ( defined( my $page = $pages->next ) ) { my $title = $page->title; my $timestamp = $page->timestamp; my $text = $page->text; print <<EOF <page> <title>$title</title> <revision> <timestamp>$timestamp</timestamp> <contributor> <username>Jidanni</username> </contributor> <text xml:space="preserve">$$text</text> </revision> </page> EOF } However, this will turn the  in the dump into

Thu May 07 19:43:25 2009 triddle [...] cpan.org - Correspondence added

Hello Jidanni, I finally got around to reading this feature request. I think it is a useful idea however I don't think it belongs in the module because it's not easy to track when it would need to be added to when the dump file changes; this would lead it to lose information in that case and silent data loss isn't a good thing. Thank you for submitting a feature request however. Tyler On Sat Sep 16 10:27:16 2006, jidanni@jidanni.org wrote: Show quoted text

> It would be nice to be able to say remove pages from dumps: > $ mydumpfilter --remove 'Nurdsburg:*' < dump.xml > newdump.xml > #removes all pages in the Nurdsburg namespace. > > But this means Parse::MediaWikiDump needs a new function to print the > entire <page> intact: > > if($title !~ m/$mypattern/){ > #optionally could change a name or date, etc. here, then > print $entire_xml_page_instance} > > Anyway, I could do > while ( defined( my $page = $pages->next ) ) { > my $title = $page->title; > my $timestamp = $page->timestamp; > my $text = $page->text; > print <<EOF > <page> > <title>$title</title> > <revision> > <timestamp>$timestamp</timestamp> > <contributor> > <username>Jidanni</username> > </contributor> > <text xml:space="preserve">$$text</text> > </revision> > </page> > EOF > } > > However, this will turn the >  > in the dump into >

Thu May 07 19:43:26 2009 The RT System itself - Status changed from 'new' to 'open'

Thu May 07 19:43:27 2009 triddle [...] cpan.org - Status changed from 'open' to 'rejected'

Thu May 07 19:56:12 2009 jidanni [...] jidanni.org - Correspondence added

Subject:	Re: [rt.cpan.org #21526] print entire xml <page> intact
Date:	Fri, 08 May 2009 07:55:51 +0800
To:	bug-parse-mediawikidump [...] rt.cpan.org
From:	jidanni [...] jidanni.org

OK.

Thu May 07 19:56:13 2009 The RT System itself - Status changed from 'rejected' to 'open'

Thu May 07 20:00:59 2009 triddle [...] cpan.org - Status changed from 'open' to 'rejected'