Skip Menu |

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 125792
Status: new
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: tlhackque [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Deep recursion failures in $tree->delete
Date: Sun, 8 Jul 2018 11:27:30 -0400
To: bug-HTML-Tree [...] rt.cpan.org
From: Timothe Litt <tlhackque [...] cpan.org>
Subject: Deep recursion failures in $tree->delete
To: bug-HTML-Tree [...] rt.cpan.org
From: Timothe Litt <tlhackque [...] cpan.org>
Symptom: Show quoted text
bug> perl treebug.pl
Deep recursion on subroutine "HTML::Element::delete" at C:/Perl64/lib/HTML/Element.pm line 567. Deep recursion on subroutine "HTML::Element::delete_content" at C:/Perl64/lib/HTML/Element.pm line 580. I have no control over the HTML, and am processing thousands of files. The failure is reproducible with a given file, but can be very sensitive to the environment. Diagnosis: I have reduced one failing case to a small html file, shown below.  It reproduces the failure 100% of the time. If you delete one <A>, the failure disappears. If you run perl -d & simply hit c<cr> - the failure disappears. As best I can tell, this is due to malformed HTML, which was originally generated in 1999, by MS Word 97 and widely distributed by another major software vendor. It is not an artificial construct. There are no closing tags on <A> elements. (!)  This produces an extreme right-leaning tree, and delete attempts to delete it using recursion, at some point hitting the Perl sanity limit. Malformed or not, if HTML::Tree can create a structure, it shouldn't have trouble deleting it. Options: It seems to me that there are several choices: - Limit the depth of structures that HTML::Tree will build.  This would make some   HTML unparsable. - Convert the deletion algorithm to an iterative one.  Something like: walk iteratively   to a leaf node, save the parent link  delete the leaf. Then repeat for the parent link   until you reach the top.  This keeps the function, but would reduce the stack usage   to reasonable levels for all inputs. - Make delete() into a NoOp.  Since you state that it's not necessary due to the use of   weak references, this may be the best option. For now, the workaround seems to be not calling delete().  However, when a long-running process is involved, it is reassuring to call delete "just in case" a circular reference was missed in the weakening. I want to emphasize that the apparently "trivial" test case below is the result of an effort to understand these failures in real-world documents.  This is not a contrived issue.  (I can supply a sample off-line if desired.) Below find: - A test program - The smallest test case that produces the problem - Version information. Here is the test program: use warnings; use strict; use HTML::Treebuilder; use Data::Dumper; $Data::Dumper::Sortkeys=1; my $file = 'treebug.raw.html'; $file = 'treebug.html'; my $dump = 0; my $data; {     open( my $fh, '<', $file) or die $!;     local $/;     $data = <$fh>;     close $fh; }     die( "no data" ) unless( $data );     my $tree = HTML::TreeBuilder->new;     $tree->parse_content( $data ); if( $dump ) { open my $fh, '>', 'treebug.dump' or die $!; print $fh Dumper( $tree ); close $fh; }     $tree->delete; exit; And here is the test file (105 lines): <HTML> <BODY> <TABLE> <TR><TD> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> <A> </TD> </TR> </TABLE> </BODY> </HTML> HTML::Element version: x $HTML::Element::VERSION 0  5.07 And Perl version: perl -V Summary of my perl5 (revision 5 version 26 subversion 1) configuration:   Platform:     osname=MSWin32     osvers=6.1     archname=MSWin32-x64-multi-thread     uname=''     config_args='undef'     hint=recommended     useposix=true     d_sigaction=undef     useithreads=define     usemultiplicity=define     use64bitint=define     use64bitall=undef     uselongdouble=undef     usemymalloc=n     default_inc_excludes_dot=define     bincompat5005=undef   Compiler:     cc='C:\Perl64\site\bin\gcc.exe'     ccflags =' -s -O2 -DWIN32 -DWIN64 -DCONSERVATIVE -DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -fwrapv -fno-strict-aliasing -mms-bitfields'     optimize='-s -O2'     cppflags='-DWIN32'     ccversion=''     gccversion='4.6.3'     gccosandvers=''     intsize=4     longsize=4     ptrsize=8     doublesize=8     byteorder=12345678     doublekind=3     d_longlong=define     longlongsize=8     d_longdbl=define     longdblsize=16     longdblkind=3     ivtype='long long'     ivsize=8     nvtype='double'     nvsize=8     Off_t='long long'     lseeksize=8     alignbytes=8     prototype=define   Linker and Libraries:     ld='C:\Perl64\site\bin\g++.exe'     ldflags ='-s -static-libgcc -static-libstdc++ -L"C:\Perl64\lib\CORE" -L"C:\MinGW\x86_64-w64-mingw32\lib"'     libpth=C:\MinGW\x86_64-w64-mingw32\lib     libs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32     perllibs=-lmoldname -lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32     libc=     so=dll     useshrplib=true     libperl=libperl526.a     gnulibc_version=''   Dynamic Linking:     dlsrc=dl_win32.xs     dlext=dll     d_dlsymun=undef     ccdlflags=' '     cccdlflags=' '     lddlflags='-mdll -s -static-libgcc -static-libstdc++ -L"C:\Perl64\lib\CORE" -L"C:\MinGW\x86_64-w64-mingw32\lib"' Characteristics of this binary (from libperl):   Compile-time options:     HAS_TIMES     HAVE_INTERP_INTERN     MULTIPLICITY     PERLIO_LAYERS     PERL_COPY_ON_WRITE     PERL_DONT_CREATE_GVSV     PERL_IMPLICIT_CONTEXT     PERL_IMPLICIT_SYS     PERL_MALLOC_WRAP     PERL_OP_PARENT     PERL_PRESERVE_IVUV     USE_64_BIT_INT     USE_ITHREADS     USE_LARGE_FILES     USE_LOCALE     USE_LOCALE_COLLATE     USE_LOCALE_CTYPE     USE_LOCALE_NUMERIC     USE_LOCALE_TIME     USE_PERLIO     USE_PERL_ATOF     USE_SITECUSTOMIZE   Locally applied patches:     ActivePerl Build 2601 [404865]   Built under MSWin32   Compiled at Dec 11 2017 12:23:25   @INC:     C:/Perl64/site/lib     C:/Perl64/lib
Download signature.asc
application/pgp-signature 834b

Message body not shown because it is not plain text.