I use your modules on Fedora. Some of the below may not make much sense
if you are not familiar with Linux. I only use your modules with one
file from enwiki, just with different dates. I start off with:
open(my $in, '-|', "bzcat enwiki-$date-pages-articles.xml.bz2") or die $!;
my $pages = Parse::MediaWikiDump->new->pages($in);
Since these dumps are huge (6G compressed, 27G uncompressed), as you can
see I use bzcat to decompress them on the fly. This has the side effect
of allowing me to use top to see performance differences because top
lists my program and bzcat separately. When using Parse::MediaWikiDump
my program uses about 90% of the cpu and bzcat uses about 10%. The
first time I tried MediaWiki::DumpFile::Compat my program used about 80%
and bzcat used about 20%. Basically this means my program was
processing data at about twice the old speed. Since the cpu used by
your modules is considered part of my program by top, and I changed
nothing else, this seemed to indicate your new module was about twice as
fast (consistent with your posted benchmarks).
The problem came when I compared the output to the previous run under
your old module. None of the category logic in my program appeared to
work. Thus, this bug report. After applying your patch and running
again, bzcat dropped back to about 10%. To get more exact performance
comparisons I again use top, but I watch bzcat's cpu time until it hits
1 minute and then check the cpu time for my program. Whichever version
of my program used less cpu time during bzcat's 1 minute of cpu time was
the most efficient.
Basically bzcat will use the same amount of cpu time everytime it
decompresses the same file. Thus when bzcat has spent 1 minute
decompressing that file it will always be at the same place in the file.
So if one version of my program has used 5 minutes of cpu time when
bzcat hits 1 minute and another version has used 10 minutes of cpu time
when bzcat has used 1 minute, the version that only used 5 minutes of
cpu time is twice as fast because it processed the same amount of data
in half the time.
Because categories() returns undef instead of an empty array in the case
of no categories, one cannot simply do @{$page->categories}. It is
first necessary to make sure $page->categories is not undef. I
accomplished this with the code:
next unless defined $page->categories;
for $cat (@{$page->categories}) {
While fixing line 333 in Compat.pm I noticed that categories() was not
caching the result, meaning this code actually causes all the code in
categories() to execute twice. Categories() in Parse::MediaWikiDump
does actually cache the result. This is what made
MediaWiki::DumpFile::Compat slower than Parse::MediaWikiDump. After
changing my code to basically:
$cats = $page->categories;
next unless defined $cats;
for $cat (@{$cats}) {
MediaWiki::DumpFile::Compat pulled back in the lead speed wise, but only
by about 12%, nowhere near the 100% I had seen originally (and your
benchmarks show).
I use four functions per page:
title()
text()
namespace()
categories()
Unless you use exactly these four functions when making your benchmarks,
you will not get the same results I do. You don't say in your
benchmarks which functions you used, but only using title() and text()
would probably give you benchmarks that are most applicable to the
average user. These are basically the absolute minimums and anything
else users can do in their own program (as is required for your new API).
Hopefully something in there was what you were looking for or actually
helps in some way. Always happy to help.