Skip Menu |

This queue is for tickets about the Archive-Zip CPAN distribution.

Report information
The Basics
Id: 13938
Status: open
Priority: 0/
Queue: Archive-Zip

People
Owner: Nobody in particular
Requestors: dma_k [...] mail.ru
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: 1.15
Fixed in: (no value)



Subject: Archive::Zip::MemberRead + XML::Parser
I have the following problem: I've got ZIP archives with huge XML files I need to parse chunk by chunk. I am trying to use Archive::Zip::MemberRead togather with XML::Parser. Unfortunately the stream is always unparseable: === code === use XML::Twig; # Runs on the top of XML::Parser use Archive::Zip; use Archive::Zip::MemberRead; my $catalog_parser = XML::Twig->new(...); my $zip = Archive::Zip->new(...); die "Failed to open ZIP file: $!" unless $zip; foreach ($zip->members()) { $catalog_parser->parse($_->readFileHandle()); } === end of code === === output === not well-formed (invalid token) at line 1, column 24, byte 24 at /usr/lib/perl5/vendor_perl/i386-linux/XML/Parser.pm line 187 === end of output === The solution was suggested by Sreeji K Das: before calling XML::Parser->parse() do: @Archive::Zip::MemberRead::ISA = qw( IO::Handle ); After that the perl script started to SEGFAULT. I have attached the strace log.
open("./mirrors/its/hoteldetails/its_hoteldetails.zip", O_RDONLY|O_LARGEFILE) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfffeb38) = -1 ENOTTY (Inappropriate ioctl for device) _llseek(4, 0, [0], SEEK_CUR) = 0 fstat64(4, {st_mode=S_IFREG|0644, st_size=459208, ...}) = 0 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 _llseek(4, 0, [0], SEEK_SET) = 0 _llseek(4, 0, [0], SEEK_CUR) = 0 read(4, "PK\3\4\24\0\2\0\10\0U}\3562V!\36\210F\1\7\0\205\370\350"..., 4096) = 4096 _llseek(4, 30, [30], SEEK_SET) = 0 _llseek(4, 0, [30], SEEK_CUR) = 0 _llseek(4, 16, [46], SEEK_CUR) = 0 _llseek(4, 0, [46], SEEK_CUR) = 0 _llseek(4, 46, [46], SEEK_SET) = 0 _llseek(4, 0, [46], SEEK_CUR) = 0 brk(0) = 0x8d82000 brk(0x8da4000) = 0x8da4000 read(4, "\354\235Yo\343F\266\307\337\363)j\362\320\230\301\264%"..., 4096) = 4096 read(4, "\20HE\267\244\342\375\242\330\t\230\v\5.NG\5\243\330I\230"..., 4096) = 4096 read(4, "\373\260\322n|\21P\312\230@\372\255\205\252\244;\7\32x"..., 4096) = 4096 read(4, ":\372\342\320oN8#\234Q\17g\234[\266\250/\26\277\233\372"..., 4096) = 4096 read(4, "\352u-\267O\265\257}\25\326 \251M\254\37Mm4\345?\313\332"..., 4096) = 4096 read(4, "\226\350\250H , , , , , , , , \254\27\10\213_\36\2521\300"..., 4096) = 4096 read(4, "\'\351\2\2728\305\342\276M\375\30V;\205J{\232\322 \314"..., 4096) = 4096 read(4, "\202P\27\345\251\213\27M\324h\220\34\274\216\304\367,\351"..., 4096) = 4096 brk(0) = 0x8da4000 brk(0x8dda000) = 0x8dda000 brk(0) = 0x8dda000 brk(0) = 0x8dda000 brk(0x8dcb000) = 0x8dcb000 brk(0) = 0x8dcb000 mmap2(NULL, 524288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x407e6000 brk(0) = 0x8dcb000 brk(0) = 0x8dcb000 brk(0x8dac000) = 0x8dac000 brk(0) = 0x8dac000 mremap(0x407e6000, 524288, 1048576, MREMAP_MAYMOVE) = 0x407e6000 mremap(0x407e6000, 1048576, 2097152, MREMAP_MAYMOVE) = 0x407e6000 mmap2(NULL, 1257472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x409e6000 munmap(0x407e6000, 2097152) = 0 mmap2(NULL, 1257472, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x407e6000 munmap(0x409e6000, 1257472) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++
From: Sreeji [sreeji at gmail.com]
[guest - Sun Jul 31 10:53:17 2005]: Show quoted text
> I have the following problem: I've got ZIP archives with huge XML > files I need to parse chunk by chunk. I am trying to use > Archive::Zip::MemberRead togather with XML::Parser. Unfortunately > the stream is always unparseable:
<snip> I can't reproduce the issue using XML::Parser. Can you try to reproduce this issue using XML::Parser (no XML::Twig - it has too many pre-reqs. and I can't download them all right now), and upload the code and relevant (small) zip files you have used ?
From: dma_k [...] mail.ru
Here goes the sample script. It is very simple. The test ZIPs follow. Notes: 1) The files in ZIPs are known to be parsed with $catalog_parser->parsefile('ltupl_hoteldetails'); (no ZIP processing) w/o any error. 2) The ZIPs are known to be read and parsed with $catalog_parser->parse($_->contents()); (parsing the whole in-memory contents) w/o any error The $ ln -s test_1.zip test.zip $ ./test.pl Processing XML: ltupl_hoteldetails. *** glibc detected *** free(): invalid next size (normal): 0x083b28a8 *** Aborted $ ln -sf test_2.zip test.zip $ ./test.pl Processing XML: ltupl_hoteldetails. Segmentation fault Additional info: perl: v5.8.7 built for i386-linux-thread-multi perl-XML-Parser-2.34 libexpat-1.95.8 glibc-2.3.5
#!/usr/bin/perl use strict; use XML::Parser; use Archive::Zip; use Archive::Zip::MemberRead; @Archive::Zip::MemberRead::ISA = qw( IO::Handle ); my $catalog_parser = new XML::Parser(); #$catalog_parser->parsefile('ltupl_hoteldetails'); exit; my $zip = Archive::Zip->new('test.zip'); die "Failed to open ZIP file 'test.zip': $!" unless $zip; foreach ($zip->members()) { print "Processing XML: " . $_->fileName() . ".\n"; #$catalog_parser->parse($_->contents()); $catalog_parser->parse($_->readFileHandle()); }
From: dma_k [...] mail.ru
First test ZIP.
Download test_1.zip
application/zip 2.1k

Message body not shown because it is not plain text.

From: dma_k [...] mail.ru
Second test ZIP.
Download test_2.zip
application/zip 11.2k

Message body not shown because it is not plain text.

[guest - Mon Aug 1 08:47:53 2005]: Show quoted text
> Here goes the sample script. It is very simple. The test ZIPs follow. > Notes: > 1) The files in ZIPs are known to be parsed with > $catalog_parser->parsefile('ltupl_hoteldetails'); > (no ZIP processing) w/o any error. > 2) The ZIPs are known to be read and parsed with > $catalog_parser->parse($_->contents()); > (parsing the whole in-memory contents) w/o any error > > The > $ ln -s test_1.zip test.zip > $ ./test.pl > Processing XML: ltupl_hoteldetails. > *** glibc detected *** free(): invalid next size (normal): 0x083b28a8 > *** > Aborted
Looking at the back trace, the problem appears to be in XML::Parser::Expat somewhere, rather than in Perl or Archive::Zip. The problem where you needed to add an ISA= qw(IO::Handle), I'll have to look into a bit more. #0 0xffffe410 in __kernel_vsyscall () (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0xb7dc856d in raise () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7dc9cf3 in abort () from /lib/tls/i686/cmov/libc.so.6 #3 0xb7dfb376 in __fsetlocking () from /lib/tls/i686/cmov/libc.so.6 #4 0xb7e015f5 in malloc_trim () from /lib/tls/i686/cmov/libc.so.6 #5 0xb7e01969 in free () from /lib/tls/i686/cmov/libc.so.6 #6 0xb7bf40fc in myfree () from /usr/local/lib/perl/5.8.7/auto/XML/Parser/Expat/Expat.so #7 0xb7bd6a20 in XML_ParserFree () from /usr/lib/libexpat.so.1 #8 0xb7bfca66 in XS_XML__Parser__Expat_ParserFree () from /usr/local/lib/perl/5.8.7/auto/XML/Parser/Expat/Expat.so #9 0x080c424d in Perl_pp_entersub () #10 0x080bd07b in Perl_runops_standard () #11 0x08060cfe in Perl_get_cv () #12 0x0806470e in Perl_call_sv () #13 0x080c6c3d in Perl_sv_clear () #14 0x080c7443 in Perl_sv_free () #15 0x080c5bc3 in Perl_sv_add_arena () #16 0x080c5c0f in Perl_sv_clean_objs () #17 0x080664a4 in perl_destruct () #18 0x0805fdba in main ()
From: dma_k [...] mail.ru
Show quoted text
> Looking at the back trace, the problem appears to be in > XML::Parser::Expat somewhere, rather than in Perl or Archive::Zip.
Ok, then how can you explain, that: $catalog_parser->parse($_->contents()); works ok, but: $catalog_parser->parse($_->readFileHandle()); does not (see Note[2] in my prev. post)?
[guest - Thu Aug 4 06:35:56 2005]: Show quoted text
> > Looking at the back trace, the problem appears to be in > > XML::Parser::Expat somewhere, rather than in Perl or Archive::Zip.
> > Ok, then how can you explain, that: > $catalog_parser->parse($_->contents()); > works ok, but: > $catalog_parser->parse($_->readFileHandle()); > does not (see Note[2] in my prev. post)?
Looking at the backtrace, it would appear XML::Parser::Expat is not handling the filehandle correctly at destruction. My guess would be that it is confused by the conversion of Archive::Zip::MemberRead to an IO::Handle. Since Archive::Zip has nothing to do with C or XS code, it cannot be ultimately responsible for causing a SEGV. It does have the possibility to cause one in another module or in the Perl core code itself. Based on the backtrace, however, the SEGV seems to be occuring in XML::Parser::Expat. Also, since this occurs following the manipulation of the internals of Archive::Zip::MemberRead outside of its code, I can assume that the manipulation is helping it. I will continue to look into the various issues you've raised and see if I can subclass Archive::Zip::MemberRead from IO::Handle. I probably will not be looking into the coredump directly since it appears to be caused by manipulation of the module internals outside of the module.
From: dma_k [...] mail.ru
SMPETERS, will you be so kind to contact XML::Parser::Expat author? I tried to use the Email address from CPAN, but nobody replied. And one more question: how to become a registered user in this bugtracker system?
Trying to clean up some RT tickets here. Is this still an issue? Does the latest revision fix the problem?
CC: SMPETERS [...] cpan.org
Subject: Re: [rt.cpan.org #13938] Archive::Zip::MemberRead + XML::Parser
Date: Thu, 19 Apr 2012 23:17:25 +0200
To: bug-Archive-Zip [...] rt.cpan.org
From: Dmitry Katsubo <dma_k [...] mail.ru>
On 19.04.2012 17:16, Brendan Byrd via RT wrote: Show quoted text
> Trying to clean up some RT tickets here. Is this still an issue? Does > the latest revision fix the problem?
You're a hero: 7 years passed since the ticket was opened. Unfortunately, the problem is still there. If you have enough courage to look deeper into the problem, you're welcome. Or perhaps you can contact XML::Parser team: maybe they can provide some input. My logfile is attached ("Out of memory" for test_1.zip and "Segfault" for test_2.zip). Any ideas are welcomed. -- With best regards, Dmitry

Message body is not shown because sender requested not to inline it.