Skip Menu |

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 79118
Status: open
Priority: 0/
Queue: XML-LibXML

People
Owner: Nobody in particular
Requestors: andre.lang [...] webrausch.de
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in: 2.0004
Fixed in: (no value)



Subject: Memory leaks in Parser on broken docs with recover => 2
Hi Shlomi, I'm using XML::LibXML in an application where I process a lot of HTML webpages. Some of these don't contain valid XML, so I set recover => 2. Setting this flag causes a lot of memory leaks when calling LibXML::Parser on an invalid file. I see this behaviour on Ubuntu 10.04 LTS XML:LibXML 2.004 Perl 5.14.2, 5.16.1 and 5.17.3. Older XML::LibXML versions show the same behaviour. I also compiled XML::LibXML against different versions of libxml, same problem. On Windows, also the same. I attach a test case where the memory bug occurs and is reported by Devel::Leak. I supply a snipplet of a problematic webpage, containing unknown tags and higher Unicode characters (which LibXML doesn't like at all, but that may be a different bug to report - fixed it converting them to entities). Basically, I encounter the leak on any page that is not parsable. So, if you run the supplied libxml2.pl you will see several "new" lines indicating a lot of SVs have been leaked. If recover is set to 0, there is no leak. This bug may also relate to similar bug #61507. If you need any further information, please let me know how I can help. Besides, thanks for you module - switched from XML::XPath as XML::LibXML doesn't leak usually :) Yours, André
Subject: libxml-trouble-sample.html
Subject: libxml2.pl
# Feeding XML::LibXML with an invalid file, triggering memory leaks use strict; use warnings; use Devel::Leak; use Encode; use XML::LibXML; BEGIN{ if ($] < 5.008){ require utf8; utf8->import(); } } if ($] >= 5.008){ binmode STDOUT, ':encoding(UTF-8)';} use Carp; check_libxml_memory(); sub check_libxml_memory { print "running\n"; my $handle; my $leaveCount = 0; my $enterCount = Devel::Leak::NoteSV($handle); # print STDERR "ENTER: $enterCount SVs\n"; { make_trouble(); # Trace how loading a bad doc affects memory } $leaveCount = Devel::Leak::CheckSV($handle); # print STDERR "\nLEAVE: $leaveCount SVs\n"; } sub make_trouble { # Tries to load a bad XML file into XML::LibXML my $filenameIn='libxml-trouble-sample.html'; local $/; #Read whole file open(my $FILEIN,'<:utf8', $filenameIn) or die "Can't read file '$filenameIn' [$!]\n"; my $str = <$FILEIN>; close ($FILEIN); # Feeds the bad file to XML::LibXML. my $parser = XML::LibXML->new; my $success=1; # if recover is set to 0 or 1, the problem ceases to exist my $doc = $parser->parse_html_string($str, {recover => 2, encoding => 'UTF-8'}); } 1;
From: andre.lang [...] webrausch.de
One more word: My test case uses :utf8 read from file which I found out you shouldn't. Still, changing it to :raw doesn't make any difference. Yours, Andre
Subject: Re: [rt.cpan.org #79118] Memory leaks in Parser on broken docs with recover => 2
Date: Wed, 22 Aug 2012 09:10:47 +0200
To: bug-XML-LibXML [...] rt.cpan.org
From: Christian Glahn <christian.glahn [...] lo-f.at>
Hi André, a brief remark regarding UTF handling problems: If you encounter problems with handling UTF characters you should report these issues directly to the libxml2 people, because XML::LibXML simply accepts what is provided by the C-layer. Of course this is not related to your report ;-) Best Christian On 21 Aug 2012, at 22:25, Sierra via RT wrote: Show quoted text
> Tue Aug 21 16:25:56 2012: Request 79118 was acted upon. > Transaction: Ticket created by andre.lang@webrausch.de > Queue: XML-LibXML > Subject: Memory leaks in Parser on broken docs with recover => 2 > Broken in: 2.0004 > Severity: Important > Owner: Nobody > Requestors: andre.lang@webrausch.de > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=79118 > > > > Hi Shlomi, > > I'm using XML::LibXML in an application where I process a lot of HTML > webpages. Some of these don't contain valid XML, so I set recover => 2. > > Setting this flag causes a lot of memory leaks when calling > LibXML::Parser on an invalid file. > > I see this behaviour on > Ubuntu 10.04 LTS > XML:LibXML 2.004 > Perl 5.14.2, 5.16.1 and 5.17.3. > > Older XML::LibXML versions show the same behaviour. I also compiled > XML::LibXML against different versions of libxml, same problem. On > Windows, also the same. > > I attach a test case where the memory bug occurs and is reported by > Devel::Leak. I supply a snipplet of a problematic webpage, containing > unknown tags and higher Unicode characters (which LibXML doesn't like at > all, but that may be a different bug to report - fixed it converting > them to entities). Basically, I encounter the leak on any page that is > not parsable. > > So, if you run the supplied libxml2.pl you will see several "new" lines > indicating a lot of SVs have been leaked. > If recover is set to 0, there is no leak. > > This bug may also relate to similar bug #61507. > > If you need any further information, please let me know how I can help. > > Besides, thanks for you module - switched from XML::XPath as XML::LibXML > doesn't leak usually :) > > Yours, > André > > ​bundle -​ Mit 2432 Seiten, Alles .​.​.​ > <libxml2.pl>
From: andre.lang [...] webrausch.de
Hi Christian, thanks for clarification. I will further investigate that UTF-8 issue before reporting it to the libxml2 developers, as I'm not sure if it's really a bug. Anyway, as I found a workaround, it's not a real pain for me now. What really bugs me are these memory leaks, as my program is long-running, processing thousands of documents. If I can't get rid of these, I will have a *lot* of work rewriting my program to fork or restart itself from time to time (which btw I don't think is really good programming style...) so any help on this one is really appreciated. Yours, André
Hi André, On Tue Aug 21 16:25:56 2012, andre.lang@webrausch.de wrote: Show quoted text
> Hi Shlomi, > > I'm using XML::LibXML in an application where I process a lot of HTML > webpages. Some of these don't contain valid XML, so I set recover => 2. >
thanks for the bug report. I'm going to investigate it now. Regards, -- Shlomi Fish Show quoted text
> Setting this flag causes a lot of memory leaks when calling > LibXML::Parser on an invalid file. > > I see this behaviour on > Ubuntu 10.04 LTS > XML:LibXML 2.004 > Perl 5.14.2, 5.16.1 and 5.17.3. > > Older XML::LibXML versions show the same behaviour. I also compiled > XML::LibXML against different versions of libxml, same problem. On > Windows, also the same. > > I attach a test case where the memory bug occurs and is reported by > Devel::Leak. I supply a snipplet of a problematic webpage, containing > unknown tags and higher Unicode characters (which LibXML doesn't like at > all, but that may be a different bug to report - fixed it converting > them to entities). Basically, I encounter the leak on any page that is > not parsable. > > So, if you run the supplied libxml2.pl you will see several "new" lines > indicating a lot of SVs have been leaked. > If recover is set to 0, there is no leak. > > This bug may also relate to similar bug #61507. > > If you need any further information, please let me know how I can help. > > Besides, thanks for you module - switched from XML::XPath as XML::LibXML > doesn't leak usually :) > > Yours, > André
Hi André, I've ran into a problem running your test program because it uses of a module called "Devel::Leak" which is nowhere to be found: https://metacpan.org/module/Devel::Leak I'm sorry it took me so long to look into it, but can you please clarify this obstacle now? Regards, -- Shlomi Fish
On Wed Aug 22 08:37:08 2012, SHLOMIF wrote: Show quoted text
> Hi André, > > I've ran into a problem running your test program because it uses of a > module called "Devel::Leak" which is nowhere to be found: > > https://metacpan.org/module/Devel::Leak > > I'm sorry it took me so long to look into it, but can you please clarify > this obstacle now? > > Regards, > > -- Shlomi Fish
Oh, I see: http://search.cpan.org/dist/Devel-Leak/ I guess it's another MetaCPAN bug. :-( Regards, -- Shlomi Fish
From: andre.lang [...] webrausch.de
Hi Shlomi, thank you for investigating the issue. You can find more information on where exaclty the leaking vars are coming from using the Test::LeakTrace::Script module from CPAN: http://search.cpan.org/~gfuji/Test-LeakTrace-0.14/lib/Test/LeakTrace/Script.pm From my test script, just remove Devel::Leak by commenting all the Devel::Leak lines, so that only the leaking function 'make_trouble' is called. Then, start the script using: perl -MTest::LeakTrace::Script=-verbose libxml2.pl This will give you the exact line numbers in the source where the leak is happening, along with it's type, last value, and reference count. Hope this helps. Yours, André
Hi André, On Thu Aug 23 11:13:53 2012, andre.lang@webrausch.de wrote: Show quoted text
> Hi Shlomi, > > thank you for investigating the issue. You can find more information > on > where exaclty the leaking vars are coming from using the > Test::LeakTrace::Script module from CPAN: > > http://search.cpan.org/~gfuji/Test-LeakTrace- > 0.14/lib/Test/LeakTrace/Script.pm > > From my test script, just remove Devel::Leak by commenting all the > Devel::Leak lines, so that only the leaking function 'make_trouble' is > called. > > Then, start the script using: > > perl -MTest::LeakTrace::Script=-verbose libxml2.pl > > This will give you the exact line numbers in the source where the leak > is happening, along with it's type, last value, and reference count. >
The problem with Test::LeakTrace::Script is that it complains about many things even without a call to make_trouble() so I don't know what to make of it. Regards, -- Shlomi Fish Show quoted text
> Hope this helps. > > Yours, > André
From: andre.lang [...] webrausch.de
Hi Shlomi, the sad truth is that there are a lot of memory bugs even in the Perl core libraries which now first show. For example, from somewhere around Perl 5.8 to 5.14.2 there were leaks in each and every regular expression using character classes or capturing groups. Most of the leaks, however, happen only once during the initialization of the libraries, at the "use" stage. These static ones are not problematic, but those that are recurring, permanently eating up memory, are. To find out those caused by XML::LibXML, just pipe the err output to a file '... 2> out.txt' and search for lines containing "LibXML" with a text editor. Another way is just tracing the "make_trouble" function by calling it with: use Test::LeakTrace; ... traceleak { make_trouble(); } -verbose; Test::LeakTrace is the base of Test::LeakTrace::Script, and you can use it to trace only leaks in one block. Downside is that it may give false alarms when altering varibles in the traceleak block which have defined been defined before the call to traceleak. If there is something I can do to help, such as giving you a filtered output out LeakTrace, please let me know. Yours, André
I'm a bit puzzled about this bug report. I tried to reproduce the leak with your test script. First of all, I noticed that passing { recover => 2 } to parse_html_string has the same effect as { recover => 1 }, i.e. warnings are not suppressed. I filed bug #93429 for this issue: https://rt.cpan.org/Ticket/Display.html?id=93429 I'm pretty sure that this bug is already present in XML::LibXML version 2.004 which you said you were using. Then your test script has some flaws which makes it report false positives as well as false negatives. 1. When you set { recover => 0 } in your test script, there won't be any reported leaks because the script simply dies. You have to catch any exceptions, then clear $@. 2. When an error is generated for the first time, some error classes are initialized which results in the allocation of some global variables. To avoid these reports, you should run the make_trouble() function once before testing for leaks. 3. It seems that reading the HTML file leaks a few IVs. This doesn't seem to be related to XML::LibXML. See the attachment for my version of the test script. When I run it with - Ubuntu 13.10 - XML::LibXML 2.0010 - libxml2 2.9.1 I get no leaks with 'recover' set to 1 or 2, but I get a leak when 'recover' is set to 0. I'll post a separate bug report about the latter issue. Nick
Subject: libxml2.pl
#!/usr/bin/perl # Feeding XML::LibXML with an invalid file, triggering memory leaks use strict; use warnings; use Devel::Leak; use XML::LibXML; check_libxml_memory(); sub check_libxml_memory { # Tries to load a bad XML file into XML::LibXML my $filenameIn='libxml-trouble-sample.html'; local $/; #Read whole file open(my $FILEIN,'<:utf8', $filenameIn) or die "Can't read file '$filenameIn' [$!]\n"; my $str = <$FILEIN>; close ($FILEIN); make_trouble($str); # Run once to initialize stuff. my $handle; my $leaveCount = 0; my $enterCount = Devel::Leak::NoteSV($handle); print STDERR "ENTER: $enterCount SVs\n"; make_trouble($str); # Trace how loading a bad doc affects memory $leaveCount = Devel::Leak::CheckSV($handle); print STDERR "\nLEAVE: $leaveCount SVs\n"; } sub make_trouble { my $str = shift; my $parser = XML::LibXML->new(recover => 2); eval { my $doc = $parser->parse_html_string($str, { encoding => 'UTF-8', }); }; $@ = undef; } 1;