Skip Menu |

This queue is for tickets about the XML-TreePP CPAN distribution.

Report information
The Basics
Id: 42441
Status: resolved
Priority: 0/
Queue: XML-TreePP

People
Owner: Nobody in particular
Requestors: mendoza [...] pvv.ntnu.no
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 0.36
Fixed in: (no value)



Subject: XML::TreePP parsing regexp throws an error when file contains certain amount of tags
This is the part of the regexp in XML::TreePP that makes perl segfault (or die with a warning in 5.10) wget -q -O - 'http://www.aggieathletics.com/sports/m-footbl/tam-m- footbl-body.html' | perl -wle 'my $q = chr(39); my $in; { local $/; $in = <>; } while ($in =~ m{ < ([^\!\?\s<>](?:"[^"]*"|$q[^ $q]*$q|[^"$q<>])*) > }sxg) {}' Segmentation fault (regexp taken from XML::TreePP) (5.10 dies with: Complex regular subexpression recursion limit (32766) exceeded at -e line 1, <> line 1.)
RT-Send-CC: xml-treepp [...] yahoogroups.com
Thank you for reporting this. It's problem that Perl 5.005-5.8.8 dies as well. I like this quite SIMPLE regexp though... < ( [^\!\?\s<>] (?: "[^"]*" | '[^']*' | [^"'<>] )* ) > tam-m-footbl-body.html seems not to be so complex. XML::TreePP works fine with more larger XML files however. Is it needed to give up using 'sxg' modifiers? Anybody... On 2009/01/15 10:11:57, NICOMEN wrote: Show quoted text
> This is the part of the regexp in XML::TreePP that makes perl segfault > (or die with a warning in 5.10) > > wget -q -O - 'http://www.aggieathletics.com/sports/m-footbl/tam-m- > footbl-body.html' | perl -wle 'my $q = chr(39); my $in; { local $/; $in > = <>; } while ($in =~ m{ < ([^\!\?\s<>](?:"[^"]*"|$q[^ > $q]*$q|[^"$q<>])*) > }sxg) {}' > Segmentation fault > > (regexp taken from XML::TreePP) > (5.10 dies with: Complex regular subexpression recursion limit (32766) > exceeded at -e line 1, <> line 1.)
wget -O footbl.html 'http://www.aggieathletics.com/sports/m-footbl/tam-m-footbl-body.html' head -1850 footbl.html | perl -wle ' local $/; my $q = chr(39); my $in = <>; my $c = 0; $c++ while ($in =~ m{ < ([^\!\?\s<>](?:"[^"]*"|$q[^$q]*$q|[^"$q<>])*) > }sxg); print $c, "\n";'
RT-Send-CC: nicomen [...] cpan.org
Hi Yusuke, I tried to elaborate a patch for this problem, but looking at the svn version on googlecode, it seems it's too simplistic. Running the test suite fails with the patch applied. The idea behind it is to "do something" even if it's dumb, to prevent segfaults. It works for us, but not for your cases where you don't have an XML prologue. Take a look at it, it's in the RT42441.patch file. As I said, this patch is unapplicable as it is, because it makes the test suite fail. We also wrote a new test case which happens to crash XML::TreePP. I believe this is very useful if you want to dedicate some time to fixing this. We might have some time to fix it ourselves, or we have to drop XML::TreePP and use other modules. Having random segfaults in production is not an option :) Hope this helps fixing the ticket.

Message body is not shown because it is too large.

--- /usr/share/perl5/XML/TreePP.pm.dist 2009-02-18 09:49:22.000000000 +0000 +++ /usr/share/perl5/XML/TreePP.pm 2009-02-18 09:53:40.000000000 +0000 @@ -698,6 +698,11 @@ return $self->die( "Tie::IxHash is required." ) unless &load_tie_ixhash(); } + # Avoid segfaults when receving random input (RT #42441) + if ( ! looks_like_xml(\$text) ) { + return; + } + my $flat = $self->xml_to_flat(\$text); my $class = $self->{base_class} if exists $self->{base_class}; my $tree = $self->flat_to_tree( $flat, '', $class ); @@ -1062,10 +1067,19 @@ $text; } +sub looks_like_xml { + my $textref = shift; + my $args = ( $$textref =~ /^(?:\s*\xEF\xBB\xBF)?\s*<\?xml(\s+\S.*)\?>/s )[0]; + if ( ! $args ) { + return; + } + return $args; +} + sub xml_decl_encoding { my $textref = shift; return unless defined $$textref; - my $args = ( $$textref =~ /^(?:\s*\xEF\xBB\xBF)?\s*<\?xml(\s+\S.*)\?>/s )[0] or return; + my $args = looks_like_xml($textref) or return; my $getcode = ( $args =~ /\s+encoding=(".*?"|'.*?')/ )[0] or return; $getcode =~ s/^['"]//; $getcode =~ s/['"]$//;
RT-Send-CC: nicomen [...] cpan.org
Version 0.41 released. http://www.kawa.net/works/perl/treepp/treepp-e.html require_xml_decl option added to avoid the segmentation fault. This does not work by default but work with the option for backward compatibility. Thanks. See also: https://rt.cpan.org/Ticket/Display.html?id=42441