Bug #42869 for HTML-TableExtract: Possible Memory Leak in TreeBuilder mode

Wed Jan 28 16:35:56 2009 greg [...] primate.net - Ticket created

Subject:	Possible Memory Leak in TreeBuilder mode
Date:	Wed, 28 Jan 2009 13:34:49 -0800
To:	bug-HTML-TableExtract [...] rt.cpan.org
From:	Greg Michalec <greg [...] primate.net>

Hi - I'm running into a problem with using TableExtract to parse a directory of HTML files. The memory footprint of the process continues to grow and grow. This does not occur when I use TableExtract in its default HTML::Parser mode, so I'm guessing their is a problem with the way TableExtract is destroying its HTML::TreeBuilder object. According to the HTML::TreeBuilder documentation, it's objects must be explicitly deleted, due to the nature of HTML::Element tree objects. Here's a test script that exhibits the problem: <code> #!/usr/bin/perl use HTML::TableExtract qw(tree); my $table = "<table>" . "<tr><td>1</td><td>2</td></tr>" x 100 . "</table>"; my $html = "<html><body>" . $table x 3 . "</body></html>"; foreach ( my $x = 0; $x <= 20; $x++) { my $p = HTML::TableExtract->new(); $p->parse($html); $p->eof; $p->delete; if (-f "/proc/$$/statm") { my $mem = `cat /proc/$$/statm`; $mem =~ s/^(\d+).*/$1/s; print "$x: $mem\n"; } } </code> Here's my system info: Ubuntu 8.10 (2.6.27-9-generic x86_64) perl v5.10.0 built for x86_64-linux-gnu-thread-multi HTML::TableExtract 2.10-3 HTML::TreeBuilder 3.23 (all perl modules are from current ubuntu 8.10 packages) Thanks!

Wed Aug 24 16:30:02 2011 MSISK [...] cpan.org - Correspondence added

Fixed in 2.11; parsing status is tracked via the eof() method which is called intrinsically with parse_file(). When eof() is called, _reset_state() is invoked automatically.

Wed Aug 24 16:30:03 2011 The RT System itself - Status changed from 'new' to 'open'

Wed Aug 24 16:30:04 2011 MSISK [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Aug 24 16:30:04 2011 MSISK [...] cpan.org - Given to MSISK