Skip Menu |

This queue is for tickets about the Parse-MediaWikiDump CPAN distribution.

Report information
The Basics
Id: 38206
Status: resolved
Priority: 0/
Queue: Parse-MediaWikiDump

People
Owner: triddle [...] cpan.org
Requestors: codepoet [...] cs.umd.edu
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 0.51
Fixed in: (no value)

Attachments


Subject: Parse::MediaWikiDump XML dump file not closed on DESTROY
Hi Tyler, First, thanks for maintaining such a useful module. Wikipedia dump files contain so much junk that it's hard to tell what's going on. Keep up the good work! I am using Parse::MediaWikiDump to read about 30,000 XML dump files, in a loop such as this: foreach my $xml (@xml_files) { my $pages = Parse::MediaWikiDump::Pages->new($xml); .... } The problem is that at some point Perl dies with a "too many open files" error. Is it possible to modify Parse::MediaWikiDump so that it closes its source file once finished parsing, or on destruction of the object? One possible location is near line 226 in the parse_more sub. An alternative is to have a simple DESTROY method such as sub DESTROY { my ($self) = @_; if (defined $$self{SOURCE}) { close $$self{SOURCE}; } } I'm using Perl 5.8.8 on Linux 2.6.18. Cheers, Mike
Subject: Re: [rt.cpan.org #38206] Parse::MediaWikiDump XML dump file not closed on DESTROY
Date: Tue, 5 Aug 2008 10:58:51 -0700
To: bug-Parse-MediaWikiDump [...] rt.cpan.org
From: "Tyler Riddle" <triddle [...] gmail.com>
Whoooooooooooooooops - that's a bug. :-) I think you are correct that the proper way to solve the problem is via DESTROY(). I'll have a fix up Real Soon Now (TM) - last time it took a couple days. I would appreciate it if you could test the fix too before I push it out to CPAN. Thanks for finding a bug! Tyler On Tue, Aug 5, 2008 at 10:39 AM, Mike Lieberman via RT < bug-Parse-MediaWikiDump@rt.cpan.org> wrote: Show quoted text
> Tue Aug 05 13:39:48 2008: Request 38206 was acted upon. > Transaction: Ticket created by codepoet > Queue: Parse-MediaWikiDump > Subject: Parse::MediaWikiDump XML dump file not closed on DESTROY > Broken in: 0.51 > Severity: Important > Owner: Nobody > Requestors: codepoet@cs.umd.edu > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > > > Hi Tyler, > > First, thanks for maintaining such a useful module. Wikipedia dump > files contain so much junk that it's hard to tell what's going on. Keep > up the good work! > > I am using Parse::MediaWikiDump to read about 30,000 XML dump files, in > a loop such as this: > > foreach my $xml (@xml_files) { > my $pages = Parse::MediaWikiDump::Pages->new($xml); > .... > } > > The problem is that at some point Perl dies with a "too many open files" > error. Is it possible to modify Parse::MediaWikiDump so that it closes > its source file once finished parsing, or on destruction of the object? > One possible location is near line 226 in the parse_more sub. An > alternative is to have a simple DESTROY method such as > > sub DESTROY { > my ($self) = @_; > if (defined $$self{SOURCE}) { > close $$self{SOURCE}; > } > } > > I'm using Perl 5.8.8 on Linux 2.6.18. > > Cheers, > Mike > >
-- Red Hat's vision for Unix is findgrepsedtee and a package manager that lies.
I started investigating the problem and came to a rather sad conclusion. I went to review my open() call, saw that I was opening with an anonymous filehandle which should be getting closed after the instance is getting garbage collected. I then created a test case with an anonymous filehandle going out of scope and closing an open file then changed over to use Parse::MediaWikiDump->new() and letting the object go out of scope. The behavior of anonymous filehandles is as expected and Parse::MediaWikiDump is not. Because of this calling close() in destroy is actually wrong because it masks the problem and will lead to leaking of memory. At least one circular reference of $self has been found and weakened but it did not solve the problem. I suspect more circular references are there and my next task is to set out and analyze the code to find them. I'm not sure what a good workaround is in the interim. You can certainly close the filehandle in DESTROY if you have enough ram (I'm not sure how much it is going to leak). I do have an idea for a dirty hack if you find it is leaking too much ram: if possible convert the program to run only one dump file and have the perl interpreter exit after processing each dump file. find and xargs might be needed, 30,000 items is a lot of things to pass on the command line, I'm not sure off the top of my head if it's too much. Tyler
I'm releasing a maintenance version of the module as 0.51.1 which documents the known issue here. Hopefully I'll get enough time to look into the memory leak soon. Tyler
I think I got a fix in for the problem with the circular references. The event handlers were updated to use a small subset of the variables available to the instantiated Parse::MediaWikiDump::Pages object which removed the circular references. I ran some tests and was able to open and close dump files more times than the maximum number of open filehandles. Can you please test these changes before I publish them? The release candidate is attached. Thanks, Tyler
Download Parse-MediaWikiDump-0.53.tar.gz
application/x-gzip 14.1k

Message body not shown because it is not plain text.

Subject: Re: [rt.cpan.org #38206] Parse::MediaWikiDump XML dump file not closed on DESTROY
Date: Tue, 11 Nov 2008 23:57:00 -0500 (EST)
To: Tyler Riddle via RT <bug-Parse-MediaWikiDump [...] rt.cpan.org>
From: Mike Lieberman <codepoet [...] umiacs.umd.edu>
Hi Tyler, Thanks for getting back to me. I tried it and was able to open up to about 27000 xmls before it gave me an "out of memory" message. But, this is more than reasonable for my usage. Mike On Wed, 29 Oct 2008, Tyler Riddle via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > > I think I got a fix in for the problem with the circular references. The event handlers were > updated to use a small subset of the variables available to the instantiated > Parse::MediaWikiDump::Pages object which removed the circular references. I ran some tests > and was able to open and close dump files more times than the maximum number of open > filehandles. > > Can you please test these changes before I publish them? The release candidate is attached. > > Thanks, > > Tyler >
Moving forward then; I'm glad to hear that got one problem solved. it still seems it's leaking RAM though, I'll check into that too. Thanks for the reports! Tyler On Tue Nov 11 23:57:24 2008, codepoet@umiacs.umd.edu wrote: Show quoted text
> Hi Tyler, > > Thanks for getting back to me. I tried it and was able to open up to > about 27000 xmls before it gave me an "out of memory" message. But, > this > is more than reasonable for my usage. > > Mike > > On Wed, 29 Oct 2008, Tyler Riddle via RT wrote: >
> > <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > > > > I think I got a fix in for the problem with the circular references.
> The event handlers were
> > updated to use a small subset of the variables available to the
> instantiated
> > Parse::MediaWikiDump::Pages object which removed the circular
> references. I ran some tests
> > and was able to open and close dump files more times than the
> maximum number of open
> > filehandles. > > > > Can you please test these changes before I publish them? The release
> candidate is attached.
> > > > Thanks, > > > > Tyler > >
I did a test to see if I could get the module to leak RAM and I was not able to. I used the following test program: #!/usr/bin/perl use strict; use warnings; use Parse::MediaWikiDump; $| = 1; print ''; my $dump_count = 0; my $article_count = 0; my $file = shift(@ARGV); die "must specify dump file" unless defined $file; while(1) { my $d = Parse::MediaWikiDump::Pages->new($file); $dump_count++; while(defined($d->next)) { $article_count++; print "\tdumps:$dump_count articles:$article_count\r"; } print "\tdumps:$dump_count articles:$article_count\r"; } I ran that program on the test dump file that lives at t/pages_test.xml and on the Simple English Wikipedia dump files at http://download.wikimedia.org/simplewiki/20081029/simplewiki-20081029-pages- articles.xml.bz2 and I never saw perl go over 4 megs of ram used even after it processed hundreds of thousands of articles. Can you please use that test program and check the memory consumption while it's running and let me know what the results are? Tyler On Wed Nov 12 07:38:58 2008, TRIDDLE wrote: Show quoted text
> Moving forward then; I'm glad to hear that got one problem solved. it > still seems it's leaking > RAM though, I'll check into that too. > > Thanks for the reports! > > Tyler > > On Tue Nov 11 23:57:24 2008, codepoet@umiacs.umd.edu wrote:
> > Hi Tyler, > > > > Thanks for getting back to me. I tried it and was able to open up
> to
> > about 27000 xmls before it gave me an "out of memory" message. But, > > this > > is more than reasonable for my usage. > > > > Mike > > > > On Wed, 29 Oct 2008, Tyler Riddle via RT wrote: > >
> > > <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > > > > > > I think I got a fix in for the problem with the circular
> references.
> > The event handlers were
> > > updated to use a small subset of the variables available to the
> > instantiated
> > > Parse::MediaWikiDump::Pages object which removed the circular
> > references. I ran some tests
> > > and was able to open and close dump files more times than the
> > maximum number of open
> > > filehandles. > > > > > > Can you please test these changes before I publish them? The
> release
> > candidate is attached.
> > > > > > Thanks, > > > > > > Tyler > > >
> > >
Subject: Re: [rt.cpan.org #38206] Parse::MediaWikiDump XML dump file not closed on DESTROY
Date: Sat, 15 Nov 2008 16:50:23 -0500 (EST)
To: Tyler Riddle via RT <bug-Parse-MediaWikiDump [...] rt.cpan.org>
From: Mike Lieberman <codepoet [...] umiacs.umd.edu>
Tyler, When I run the script you sent me the memory usage does not grow. I think this is because your test script iterates over all articles in the dump file. However if you don't iterate over all the articles there looks like memory leaks. Take al ook at the following script for example: #! /usr/bin/perl use warnings; use strict; use Parse::MediaWikiDump; my $file = shift @ARGV; for (;;) { my $p = Parse::MediaWikiDump::Pages->new($file); #while ($p->next) { } } 0; As is the script causes memory leaks somewhere. However if you uncomment the while() line, the memory usage doesn't grow. Mike On Fri, 14 Nov 2008, Tyler Riddle via RT wrote: Show quoted text
> <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > > I did a test to see if I could get the module to leak RAM and I was not able to. I used the > following test program: > > #!/usr/bin/perl > > use strict; > use warnings; > > use Parse::MediaWikiDump; > > $| = 1; > print ''; > > my $dump_count = 0; > my $article_count = 0; > my $file = shift(@ARGV); > > die "must specify dump file" unless defined $file; > > while(1) { > my $d = Parse::MediaWikiDump::Pages->new($file); > > $dump_count++; > > while(defined($d->next)) { > $article_count++; > print "\tdumps:$dump_count articles:$article_count\r"; > } > > print "\tdumps:$dump_count articles:$article_count\r"; > } > > > I ran that program on the test dump file that lives at t/pages_test.xml and on the Simple > English Wikipedia dump files at http://download.wikimedia.org/simplewiki/20081029/simplewiki-20081029-pages- > articles.xml.bz2 and I never saw perl go over 4 megs of ram used even after it processed > hundreds of thousands of articles. Can you please use that test program and check the > memory consumption while it's running and let me know what the results are? > > Tyler > > > On Wed Nov 12 07:38:58 2008, TRIDDLE wrote:
>> Moving forward then; I'm glad to hear that got one problem solved. it >> still seems it's leaking >> RAM though, I'll check into that too. >> >> Thanks for the reports! >> >> Tyler >> >> On Tue Nov 11 23:57:24 2008, codepoet@umiacs.umd.edu wrote:
>>> Hi Tyler, >>> >>> Thanks for getting back to me. I tried it and was able to open up
>> to
>>> about 27000 xmls before it gave me an "out of memory" message. But, >>> this >>> is more than reasonable for my usage. >>> >>> Mike >>> >>> On Wed, 29 Oct 2008, Tyler Riddle via RT wrote: >>>
>>>> <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > >>>> >>>> I think I got a fix in for the problem with the circular
>> references.
>>> The event handlers were
>>>> updated to use a small subset of the variables available to the
>>> instantiated
>>>> Parse::MediaWikiDump::Pages object which removed the circular
>>> references. I ran some tests
>>>> and was able to open and close dump files more times than the
>>> maximum number of open
>>>> filehandles. >>>> >>>> Can you please test these changes before I publish them? The
>> release
>>> candidate is attached.
>>>> >>>> Thanks, >>>> >>>> Tyler >>>>
>> >> >>
> > > > > >
Thanks for the testing, I was able to fix the memory leak by wrapping expat with Object::Destroyer; it's a hack but it works pretty well. I've pushed out version 0.54 into CPAN, it should fix the issue. I'm going to close the bug, please add more notes if something else crops up. Thanks again for the bug reports and testing! Tyler On Sat Nov 15 16:51:18 2008, codepoet@umiacs.umd.edu wrote: Show quoted text
> Tyler, > > When I run the script you sent me the memory usage does not grow. I > think > this is because your test script iterates over all articles in the > dump > file. However if you don't iterate over all the articles there looks > like > memory leaks. Take al ook at the following script for example: > > #! /usr/bin/perl > > use warnings; > use strict; > > use Parse::MediaWikiDump; > > my $file = shift @ARGV; > > for (;;) { > my $p = Parse::MediaWikiDump::Pages->new($file); > #while ($p->next) { } > } > > 0; > > > As is the script causes memory leaks somewhere. However if you > uncomment > the while() line, the memory usage doesn't grow. > > Mike > > > On Fri, 14 Nov 2008, Tyler Riddle via RT wrote: >
> > <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > > > > I did a test to see if I could get the module to leak RAM and I was
> not able to. I used the
> > following test program: > > > > #!/usr/bin/perl > > > > use strict; > > use warnings; > > > > use Parse::MediaWikiDump; > > > > $| = 1; > > print ''; > > > > my $dump_count = 0; > > my $article_count = 0; > > my $file = shift(@ARGV); > > > > die "must specify dump file" unless defined $file; > > > > while(1) { > > my $d = Parse::MediaWikiDump::Pages->new($file); > > > > $dump_count++; > > > > while(defined($d->next)) { > > $article_count++; > > print "\tdumps:$dump_count
> articles:$article_count\r";
> > } > > > > print "\tdumps:$dump_count articles:$article_count\r"; > > } > > > > > > I ran that program on the test dump file that lives at
> t/pages_test.xml and on the Simple
> > English Wikipedia dump files at
> http://download.wikimedia.org/simplewiki/20081029/simplewiki-20081029- > pages-
> > articles.xml.bz2 and I never saw perl go over 4 megs of ram used
> even after it processed
> > hundreds of thousands of articles. Can you please use that test
> program and check the
> > memory consumption while it's running and let me know what the
> results are?
> > > > Tyler > > > > > > On Wed Nov 12 07:38:58 2008, TRIDDLE wrote:
> >> Moving forward then; I'm glad to hear that got one problem solved.
> it
> >> still seems it's leaking > >> RAM though, I'll check into that too. > >> > >> Thanks for the reports! > >> > >> Tyler > >> > >> On Tue Nov 11 23:57:24 2008, codepoet@umiacs.umd.edu wrote:
> >>> Hi Tyler, > >>> > >>> Thanks for getting back to me. I tried it and was able to open up
> >> to
> >>> about 27000 xmls before it gave me an "out of memory" message.
> But,
> >>> this > >>> is more than reasonable for my usage. > >>> > >>> Mike > >>> > >>> On Wed, 29 Oct 2008, Tyler Riddle via RT wrote: > >>>
> >>>> <URL: http://rt.cpan.org/Ticket/Display.html?id=38206 > > >>>> > >>>> I think I got a fix in for the problem with the circular
> >> references.
> >>> The event handlers were
> >>>> updated to use a small subset of the variables available to the
> >>> instantiated
> >>>> Parse::MediaWikiDump::Pages object which removed the circular
> >>> references. I ran some tests
> >>>> and was able to open and close dump files more times than the
> >>> maximum number of open
> >>>> filehandles. > >>>> > >>>> Can you please test these changes before I publish them? The
> >> release
> >>> candidate is attached.
> >>>> > >>>> Thanks, > >>>> > >>>> Tyler > >>>>
> >> > >> > >>
> > > > > > > > > > > >