Bug #63453 for MediaWiki-DumpFile: categories() does not work in MediaWiki::DumpFile::Compat

Mon Nov 29 19:34:52 2010 Pascal [...] Rockford.Com - Ticket created

Subject:

categories() does not work in MediaWiki::DumpFile::Compat

Mon Nov 29 20:24:22 2010 triddle [...] cpan.org - Taken

Mon Nov 29 20:27:34 2010 triddle [...] cpan.org - Correspondence added

It's a missing scalar dereference - you can fix it by editing line 333 of MediaWiki/DumpFile/Compat.pm in your perl library directory and changing it from while($text =~ m/\[\[$anchor:\s*([^\]]+)\]\]/gi) { to while($$text =~ m/\[\[$anchor:\s*([^\]]+)\]\]/gi) { I'll get an update onto CPAN some time soon.

Mon Nov 29 20:27:35 2010 The RT System itself - Status changed from 'new' to 'open'

Tue Nov 30 12:42:08 2010 Pascal [...] Rockford.Com - Correspondence added

From:

Pascal [...] Rockford.Com

First off, I would like to take this opportunity to thank you for writing and releasing your various Perl modules. You have obviously put a lot of time and hard work into them and it is appreciated. Good news is that change does fix the problem. Bad news is this bug appears to be why MediaWiki::DumpFile::Compat was so much faster than Parse::MediaWikiDump. With that bug fix in place Parse::MediaWikiDump is actually faster than MediaWiki::DumpFile::Compat. Not by much, about 8% or so.

Tue Nov 30 12:58:05 2010 triddle [...] cpan.org - Correspondence added

Thanks for spreading some love and filling me in. :-) I decided to sacrifice speed on the MediaWiki::DumpFile SQL parser to make it configurable and easier to maintain for a large set of SQL formats. There was a contribution of a faster SQL parsing implementation for the whole suite, not just the backwards compatibility library, in another ticket https://rt.cpan.org/Ticket/Display.html?id=53370 but I never got around to testing and integrating it. You might try that and see if it winds up being faster. Reports and patches would gladly be accepted on that topic too. I'll leave this ticket open until I get the new version pushed out onto CPAN. Cheers and happy coding, Tyler On Tue Nov 30 12:42:08 2010, Pascal666 wrote: Show quoted text

> First off, I would like to take this opportunity to thank you for > writing and releasing your various Perl modules. You have obviously put > a lot of time and hard work into them and it is appreciated. > > Good news is that change does fix the problem. Bad news is this bug > appears to be why MediaWiki::DumpFile::Compat was so much faster than > Parse::MediaWikiDump. With that bug fix in place Parse::MediaWikiDump > is actually faster than MediaWiki::DumpFile::Compat. Not by much, about > 8% or so.

Tue Nov 30 12:58:06 2010 triddle [...] cpan.org - Status changed from 'open' to 'patched'

Wed Dec 01 13:45:09 2010 triddle [...] cpan.org - Correspondence added

I realized you were probably talking about the processing speed of the ::Pages class, not the SQL parser (not sure where I got that from). You are seeing a speed reduction then in parsing the xml page dump archive? That's interesting and unexpected, all my tests indicated the opposite however I've noticed that XML parsing is highly dependent on the structure of the document and especially the ratio of markup to non-markup in the document. This may account for the speed discrepancy you are seeing. Would you mind sharing the dataset and code that's showing this behavior? I'm somewhat curious in how things are behaving in the real world. Cheers, Tyler

Wed Dec 01 13:45:11 2010 The RT System itself - Status changed from 'patched' to 'open'

Wed Dec 01 21:35:25 2010 Pascal [...] Rockford.Com - Correspondence added

From:

Pascal [...] Rockford.Com

I use your modules on Fedora. Some of the below may not make much sense if you are not familiar with Linux. I only use your modules with one file from enwiki, just with different dates. I start off with: open(my $in, '-|', "bzcat enwiki-$date-pages-articles.xml.bz2") or die $!; my $pages = Parse::MediaWikiDump->new->pages($in); Since these dumps are huge (6G compressed, 27G uncompressed), as you can see I use bzcat to decompress them on the fly. This has the side effect of allowing me to use top to see performance differences because top lists my program and bzcat separately. When using Parse::MediaWikiDump my program uses about 90% of the cpu and bzcat uses about 10%. The first time I tried MediaWiki::DumpFile::Compat my program used about 80% and bzcat used about 20%. Basically this means my program was processing data at about twice the old speed. Since the cpu used by your modules is considered part of my program by top, and I changed nothing else, this seemed to indicate your new module was about twice as fast (consistent with your posted benchmarks). The problem came when I compared the output to the previous run under your old module. None of the category logic in my program appeared to work. Thus, this bug report. After applying your patch and running again, bzcat dropped back to about 10%. To get more exact performance comparisons I again use top, but I watch bzcat's cpu time until it hits 1 minute and then check the cpu time for my program. Whichever version of my program used less cpu time during bzcat's 1 minute of cpu time was the most efficient. Basically bzcat will use the same amount of cpu time everytime it decompresses the same file. Thus when bzcat has spent 1 minute decompressing that file it will always be at the same place in the file. So if one version of my program has used 5 minutes of cpu time when bzcat hits 1 minute and another version has used 10 minutes of cpu time when bzcat has used 1 minute, the version that only used 5 minutes of cpu time is twice as fast because it processed the same amount of data in half the time. Because categories() returns undef instead of an empty array in the case of no categories, one cannot simply do @{$page->categories}. It is first necessary to make sure $page->categories is not undef. I accomplished this with the code: next unless defined $page->categories; for $cat (@{$page->categories}) { While fixing line 333 in Compat.pm I noticed that categories() was not caching the result, meaning this code actually causes all the code in categories() to execute twice. Categories() in Parse::MediaWikiDump does actually cache the result. This is what made MediaWiki::DumpFile::Compat slower than Parse::MediaWikiDump. After changing my code to basically: $cats = $page->categories; next unless defined $cats; for $cat (@{$cats}) { MediaWiki::DumpFile::Compat pulled back in the lead speed wise, but only by about 12%, nowhere near the 100% I had seen originally (and your benchmarks show). I use four functions per page: title() text() namespace() categories() Unless you use exactly these four functions when making your benchmarks, you will not get the same results I do. You don't say in your benchmarks which functions you used, but only using title() and text() would probably give you benchmarks that are most applicable to the average user. These are basically the absolute minimums and anything else users can do in their own program (as is required for your new API). Hopefully something in there was what you were looking for or actually helps in some way. Always happy to help.

Wed Dec 01 23:15:32 2010 Pascal [...] Rockford.Com - Correspondence added

From:

Pascal [...] Rockford.Com

Sorry, just reread your docs for MediaWiki::DumpFile and you do basically say you used text() and title() for each page. Also just reread my program. Actually, I am only calling namespace() on all pages. Right now I am only looking at pages in the main and Category namespaces, so if the namespace doesn't match one of those two I skip it. If it does match, then I call the others. I was considering reimplementing with MediaWiki::DumpFile::FastPages and bringing namespace() and categories() internal, problem is I would need namespaces_names() or equivalent to implement namespace() and MediaWiki::DumpFile::FastPages does not implement that. Not really sure why, namespaces() would just pull some data from the beginning of the dump. Storing that information at the beginning and making it available later would not slow down the main loop. I'll probably reimplement using MediaWiki::DumpFile::FastPages for the main loop after using MediaWiki::DumpFile::Compat for namespaces_names(). I also noticed that your categories() would probably be faster if you compiled the regex during new() and namespace() may be faster if you built a hash during new() from namespaces_names() and looked in there for the text before the colon instead of walking nested arrays each time.

Thu Dec 02 00:51:34 2010 Pascal [...] Rockford.Com - Correspondence added

From:

Pascal [...] Rockford.Com

I should probably clarify that all percentages above are relative to my program and are not a direct comparison between your modules. For example, if my code uses 75% of the cpu time and your module uses 25%, an increase in speed of 12.5% on the overall program with no changes in my code would indicate a 50% improvement in the performance of your module. So although I can say which of your modules is fastest in the way I am using them, none of the above should be interpreted as implying any of your posted benchmarks are incorrect.

Thu Dec 02 14:18:43 2010 triddle [...] cpan.org - Correspondence added

Thank you for all your feedback! Your comments about the missing features on ::FastPages got me thinking. FastPages started life as a bit of research I did on measuring the parsing performance of the XML parsers and parsing schemes on CPAN. Using the LibXML reader like ::FastPages did was quick but very difficult to parse the whole document which is why it doesn't understand any of the metadata or other content in the article aside from titles and text. However as I found out last night this wasn't strictly needed - the base parsing package for ::Pages has a wrapper around the LibXML reader interface and I was able to integrate ::FastPages into ::Pages as an option. Now you can turn fast mode on and off in ::Pages as well as control each iteration to be fast or slow; fast mode also uses a duck typed stand-in to provide the same API for a page in fast or slow mode. A test version of MediaWiki::DumpFIle is attached - it fixes your original bug report, lack of caching in ::Compat, and integrates together ::FastPages and ::Pages The categories method is still missing as is the namespaces names (but now you do get the namespaces method available and fast parsing still). The reason that categories is missing from MediaWiki::DumpFile is because support for that feature is difficult to do right; difficult enough that I decided to not do the same incomplete implementation that Parse::MediaWikiDump had. It is easy enough to check for the category definitions in the article text however this is not the authoritative category association information. For one, a template may be included in the file which sets the category: the categories method has no knowledge of this and can't really have knowledge of it with out pre-processing the full dump files (including templates) or going to the SQL dumps. This means there needs to be a way to pull context from the entire mediawiki instance - it can be recreated from the dump files but doing it right is a hassle. I have plans to create a class that can manage those operations, build up indexes and do some caching, but it's not here today. After your feedback I've realized these missing features are probably slowing adoption of MediaWiki::DumpFile which is not a good thing. Thanks again for your feedback, it is sincerely appreciated! If you have any more thoughts or comments I'd love to hear them.

Subject:

MediaWiki-DumpFile-0.1.9_01.tar.gz

Download MediaWiki-DumpFile-0.1.9_01.tar.gz
application/x-gzip 17.7k

Message body not shown because it is not plain text.

Sat Dec 04 18:21:53 2010 triddle [...] cpan.org - Correspondence added

Hello again, I'd like to thank you one more time for your correspondence. It's helped put me into the frame of mind of someone using the software I've created which has helped me to improve it. I've incorporated nearly all of your ideas and suggestions plus the previous modification I made with a lot more polish of MediaWiki::DumpFile::Compat to produce MediaWiki::DumpFile version 0.2.0 which is out for some testing right now. I've attached a pre-release version to this ticket because it directly addresses the project you told me about. Unfortunately it may be too late to save you time if you've already implemented the work-arounds you described but your use case was the direct inspiration for the modifications. Here's the changelog entries you'll be interested in: * Fast mode is here! ::Pages, ::FastPages, and ::Compat::Pages can all be very fast by giving up support for everything besides titles and the text contents of the first entry in the dump file. * Added the XML benchmarking suite I created to study XML processing speeds to distro; hopefully more people will be interested in the shootout. * Ported over documentation from Parse::MediaWikiDump giving ::Compat full documentation in this module as well. The benchmark suite is what I used to measure the performance before and to generate the updated performance metrics in the new documentation. I'm currently waiting for a run to finish measuring the performance of all modules on the english wikipedia and that will also go into the documentation when it's done and before 0.2.0 ships. Fast mode works great with the ::Compat libs now - just pass the fast_mode option as an option to the constructor of Parse::MediaWikiDump::Pages when using the new named parameters interface. Fast mode lets the compat Parse::MediaWikiDump::Pages class parse over 20 megs a second on even troublesome dumps - the English Wikipedia dumps should be much much faster, I predict closer to 40 meg/sec. Feedback would be appreciated if you've got any more otherwise I'm going to close this ticket once I publish 0.2.0. Cheers, Tyler

Subject:

MediaWiki-DumpFile-0.2.0_03.tar.gz

Download MediaWiki-DumpFile-0.2.0_03.tar.gz
application/x-gzip 37k

Message body not shown because it is not plain text.

Sun Dec 05 14:14:55 2010 triddle [...] cpan.org - Correspondence added

I just got 0.2.0 of MediaWiki::DumpFile pushed out incorporating all the previously mentioned changes plus regex compilation in MediaWIki::DumpFile::Compat Thanks again for your bug report, Cheers, Tyler Riddle

Sun Dec 05 14:14:56 2010 triddle [...] cpan.org - Status changed from 'open' to 'resolved'