Subject: | Deep recursion failures in $tree->delete |
Date: | Sun, 8 Jul 2018 11:27:30 -0400 |
To: | bug-HTML-Tree [...] rt.cpan.org |
From: | Timothe Litt <tlhackque [...] cpan.org> |
Subject: | Deep recursion failures in $tree->delete |
To: | bug-HTML-Tree [...] rt.cpan.org |
From: | Timothe Litt <tlhackque [...] cpan.org> |
Symptom:
Show quoted text
bug> perl treebug.pl
Deep recursion on subroutine "HTML::Element::delete" at
C:/Perl64/lib/HTML/Element.pm line 567.
Deep recursion on subroutine "HTML::Element::delete_content" at
C:/Perl64/lib/HTML/Element.pm line 580.
I have no control over the HTML, and am processing thousands of files.
The failure is reproducible with a given file, but can be very sensitive
to the environment.
Diagnosis:
I have reduced one failing case to a small html file, shown below. It
reproduces the
failure 100% of the time.
If you delete one <A>, the failure disappears.
If you run perl -d & simply hit c<cr> - the failure disappears.
As best I can tell, this is due to malformed HTML, which was originally
generated
in 1999, by MS Word 97 and widely distributed by another major software
vendor.
It is not an artificial construct.
There are no closing tags on <A> elements. (!) This produces an extreme
right-leaning tree, and delete attempts to delete it using recursion, at
some point
hitting the Perl sanity limit.
Malformed or not, if HTML::Tree can create a structure, it shouldn't
have trouble
deleting it.
Options:
It seems to me that there are several choices:
- Limit the depth of structures that HTML::Tree will build. This would
make some
HTML unparsable.
- Convert the deletion algorithm to an iterative one. Something like:
walk iteratively
to a leaf node, save the parent link delete the leaf. Then repeat for
the parent link
until you reach the top. This keeps the function, but would reduce
the stack usage
to reasonable levels for all inputs.
- Make delete() into a NoOp. Since you state that it's not necessary
due to the use of
weak references, this may be the best option.
For now, the workaround seems to be not calling delete(). However, when a
long-running process is involved, it is reassuring to call delete "just
in case" a
circular reference was missed in the weakening.
I want to emphasize that the apparently "trivial" test case below is the
result of an
effort to understand these failures in real-world documents. This is
not a contrived
issue. (I can supply a sample off-line if desired.)
Below find:
- A test program
- The smallest test case that produces the problem
- Version information.
Here is the test program:
use warnings;
use strict;
use HTML::Treebuilder;
use Data::Dumper; $Data::Dumper::Sortkeys=1;
my $file = 'treebug.raw.html';
$file = 'treebug.html';
my $dump = 0;
my $data;
{
open( my $fh, '<', $file) or die $!;
local $/;
$data = <$fh>;
close $fh;
}
die( "no data" ) unless( $data );
my $tree = HTML::TreeBuilder->new;
$tree->parse_content( $data );
if( $dump ) {
open my $fh, '>', 'treebug.dump' or die $!;
print $fh Dumper( $tree );
close $fh;
}
$tree->delete;
exit;
And here is the test file (105 lines):
<HTML>
<BODY>
<TABLE>
<TR><TD>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
<A>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>
HTML::Element version:
x $HTML::Element::VERSION
0 5.07
And Perl version:
perl -V
Summary of my perl5 (revision 5 version 26 subversion 1) configuration:
Platform:
osname=MSWin32
osvers=6.1
archname=MSWin32-x64-multi-thread
uname=''
config_args='undef'
hint=recommended
useposix=true
d_sigaction=undef
useithreads=define
usemultiplicity=define
use64bitint=define
use64bitall=undef
uselongdouble=undef
usemymalloc=n
default_inc_excludes_dot=define
bincompat5005=undef
Compiler:
cc='C:\Perl64\site\bin\gcc.exe'
ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE
-DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT
-DPERL_IMPLICIT_SYS -fwrapv -fno-strict-aliasing -mms-bitfields'
optimize='-s -O2'
cppflags='-DWIN32'
ccversion=''
gccversion='4.6.3'
gccosandvers=''
intsize=4
longsize=4
ptrsize=8
doublesize=8
byteorder=12345678
doublekind=3
d_longlong=define
longlongsize=8
d_longdbl=define
longdblsize=16
longdblkind=3
ivtype='long long'
ivsize=8
nvtype='double'
nvsize=8
Off_t='long long'
lseeksize=8
alignbytes=8
prototype=define
Linker and Libraries:
ld='C:\Perl64\site\bin\g++.exe'
ldflags ='-s -static-libgcc -static-libstdc++ -L"C:\Perl64\lib\CORE"
-L"C:\MinGW\x86_64-w64-mingw32\lib"'
libpth=C:\MinGW\x86_64-w64-mingw32\lib
libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr
-lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool
-lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid
-lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32
libc=
so=dll
useshrplib=true
libperl=libperl526.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_win32.xs
dlext=dll
d_dlsymun=undef
ccdlflags=' '
cccdlflags=' '
lddlflags='-mdll -s -static-libgcc -static-libstdc++
-L"C:\Perl64\lib\CORE" -L"C:\MinGW\x86_64-w64-mingw32\lib"'
Characteristics of this binary (from libperl):
Compile-time options:
HAS_TIMES
HAVE_INTERP_INTERN
MULTIPLICITY
PERLIO_LAYERS
PERL_COPY_ON_WRITE
PERL_DONT_CREATE_GVSV
PERL_IMPLICIT_CONTEXT
PERL_IMPLICIT_SYS
PERL_MALLOC_WRAP
PERL_OP_PARENT
PERL_PRESERVE_IVUV
USE_64_BIT_INT
USE_ITHREADS
USE_LARGE_FILES
USE_LOCALE
USE_LOCALE_COLLATE
USE_LOCALE_CTYPE
USE_LOCALE_NUMERIC
USE_LOCALE_TIME
USE_PERLIO
USE_PERL_ATOF
USE_SITECUSTOMIZE
Locally applied patches:
ActivePerl Build 2601 [404865]
Built under MSWin32
Compiled at Dec 11 2017 12:23:25
@INC:
C:/Perl64/site/lib
C:/Perl64/lib
Message body not shown because it is not plain text.