Bug #113293 for PDF-API2: Corrupted PDF produced by importPageIntoForm

Wed Mar 23 11:09:38 2016 melmothx [...] gmail.com - Ticket created

Subject:	Corrupted PDF produced by importPageIntoForm
Date:	Wed, 23 Mar 2016 16:09:15 +0100
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

Hi there! It looks like I've hit a bug, but honestly I don't know where to start looking. Attached you can find two (apparently) identical PDFs, one produced by xetex and one by luatex. When I import the one produced by luatex, the second page get lost. Evince just displays an empty page, while mupdf says: error: stack overflow warning: Ignoring errors during rendering mupdf: warning: Errors found on page Code to reproduce: #!perl use strict; use warnings; use PDF::API2; for my $file (qw/xetex luatex/) { my $in = PDF::API2->open("t/resources/$file.pdf"); my $out = PDF::API2->new; my $page = $out->page; my $gfx = $page->gfx; for my $p (1, 2) { if (my $included = $out->importPageIntoForm($in, $p)) { print "Including $p\n"; $gfx->formimage($included, 100 * $p, 0) } } $out->saveas("t/resources/$file.out.pdf"); $out->end; $in->end; }

Download xetex.pdf
application/pdf 26.2k

Message body not shown because it is not plain text.

Download luatex.pdf
application/pdf 26.9k

Message body not shown because it is not plain text.

Please let know if I can be of any help here. Best wishes -- Marco

Thu Mar 24 10:47:50 2016 philperry [...] hvc.rr.com - Correspondence added

Subject:	[rt.cpan.org #113293]
Date:	Thu, 24 Mar 2016 10:48:00 -0400
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Phil M Perry <philperry [...] hvc.rr.com>

Do you happen to know what PDF version is being output by XeTeX and LuaTeX? If they're greater than 1.3 (or maybe 1.4), they may have content that breaks PDF::API2. If you can throw a switch somewhere to output PDF 1.3 on these programs, could you see if the problem persists? One thing PDF::API2 needs is work to bring it up to PDF 1.7 compatibility, so it can safely import/load PDFs produced by other programs.

Thu Mar 24 10:47:50 2016 The RT System itself - Status changed from 'new' to 'open'

Thu Mar 24 11:12:31 2016 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113293]
Date:	Thu, 24 Mar 2016 16:12:17 +0100
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

"philperry@hvc.rr.com via RT" <bug-PDF-API2@rt.cpan.org> writes: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=113293 > > > Do you happen to know what PDF version is being output by XeTeX and > LuaTeX? If they're greater than 1.3 (or maybe 1.4), they may have > content that breaks PDF::API2. If you can throw a switch somewhere to > output PDF 1.3 on these programs, could you see if the problem persists? > One thing PDF::API2 needs is work to bring it up to PDF 1.7 > compatibility, so it can safely import/load PDFs produced by other programs.

XeTeX and LuaTeX usually give 1.5 output. With 1.4 output, things are fine. The files I sent are 1.5. The puzzling thing is that 1.5 produced by XeTeX is fine as well (as you can see from the tests), so yes, apparently luatex is including some content (which one is a good question) the module doesn't like (but doesn't throw an exception either). -- Marco

Sat Mar 26 10:45:57 2016 steve [...] deefs.net - Correspondence added

Here's what I've found so far: Using Perl 5.8.5 (the minimum supported version), your test script seg faults while trying to create luatex.out.pdf. It works fine with Perl 5.14.2 (my system Perl). Looking through both PDFs using contrib/pdf-debug.pl, they're both basically identical until we get to the XObject (form image) on page 2. For luatex.out.pdf, instead of including object 13 (/Contents of page 2), it's including object 2 (an /ObjStm). That explains why the resulting PDF is broken. Looking at luatex.pdf, object 13 is the second object in the file, so I'm wondering if the object stream contents are getting saved in the wrong slot and overwriting the page contents. It could just be coincidence, but it's where I'll be looking next.

Thu Jun 02 09:55:08 2016 steve [...] deefs.net - Correspondence added

Here's a brief update on this ticket: I spent a large chunk of yesterday trying to isolate the problem, but haven't gotten there yet. Observations: - The Page content stream in the source PDF is getting replaced by an object stream during an unrelated operation (copying a different object from the source PDF to the target PDF). - With the provided luatex.pdf, the problem always happens on the second page to be included (reversing the order from 1, 2 to 2, 1 causes page 1 to be corrupt). - importPageIntoForm calls walk_obj (see the $ssk foreach loop) which calls itself recursively. It calls the realise method on the source object if it's an indirect object. On the second page to be included, inside the first time importPageIntoForm calls walk_obj, but not necessarily the outer walk_obj call, this realise call is the one that causes the object stream to get embedded in the wrong place in the source PDF. - The realise method just calls read_obj from .../Basic/PDF/FIle.pm - read_obj calls read_objnum and merge. The merge function doesn't appear to be the problem, and it would make sense for read_objnum to have the bug since it's dealing with object streams. That's as far as I got before running out of time yesterday. It's possible I'll be able to work on it some more next week, but after that I'll be away until July. There will be another PDF::API2 release as soon as this issue gets fixed, so if anyone reading this wants to take the above and run with it, I'll happily give you full credit for a patch fixing the bug. :-)

Thu Jun 02 11:33:06 2016 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:	Thu, 02 Jun 2016 17:32:54 +0200
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

Show quoted text

> That's as far as I got before running out of time yesterday. It's > possible I'll be able to work on it some more next week, but after > that I'll be away until July. There will be another PDF::API2 release > as soon as this issue gets fixed, so if anyone reading this wants to > take the above and run with it, I'll happily give you full credit for > a patch fixing the bug. :-)

Hello Steve, I'll try to give a shoot at it as well when I'll grab some brain-fresh time (probably over the weekend). Best wishes -- Marco

Fri Jun 03 15:08:04 2016 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:	Fri, 03 Jun 2016 21:07:53 +0200
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

Some notes, hoping it will ring any bell: In the $ssk loop, it looks like the contents of the page changes, even if it's not directly passed to the function: print "Including resource <$s_pdf $k $sk $ssk> $xo->resource($sk,$ssk,walk_obj($self->{apiimportcache}->{$s_pdf},$s_pdf->{pdf},$self->{pdf},$s_page->{$k}->{$sk}->{$ssk})\n"; { my ($content) = $s_page->{Contents}->elementsof; print "Page content (before embedding $ssk $s_page->{$k}->{$sk}->{$ssk}) is " . ref($content) . "\n"; } my $walked = walk_obj($self->{apiimportcache}->{$s_pdf}, $s_pdf->{pdf}, $self->{pdf}, $s_page->{$k}->{$sk}->{$ssk}); { my ($content) = $s_page->{Contents}->elementsof; print "Page content (embedding $ssk) is " . ref($content) . "\n"; } $xo->resource($sk,$ssk,$walked); And yields: Page content (before embedding Im1 PDF::API2::Basic::PDF::Objind=HASH(0x192e8b0)) is PDF::API2::Content Realising objind 10 Realising objind 16 Realising objind 15 Page content (embedding Im1) is PDF::API2::Content Page content (before embedding F25 PDF::API2::Basic::PDF::Objind=HASH(0x192df50)) is PDF::API2::Content Realising objind 8 Realising objind 17 Realising objind 20 Realising objind 18 Realising objind 19 Realising objind 21 Page content (embedding F25) is PDF::API2::Basic::PDF::Dict Swapping the pages: Page content (before embedding F25 PDF::API2::Basic::PDF::Objind=HASH(0x36407e8)) is PDF::API2::Content Realising objind 8 Realising objind 21 Realising objind 17 Realising objind 20 Realising objind 19 Realising objind 18 Page content (embedding F25) is PDF::API2::Content Page content (before embedding Im1 PDF::API2::Basic::PDF::Objind=HASH(0x3601a80)) is PDF::API2::Content Realising objind 10 Realising objind 15 Realising objind 16 Page content (embedding Im1) is PDF::API2::Basic::PDF::Dict With the xetex file, the page contents is always a Content object. That Dict is the object stream which causes the corruption. So much for today -- Marco

Tue Jun 07 13:31:25 2016 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:	Tue, 07 Jun 2016 19:31:08 +0200
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

"Steve Simms via RT" <bug-PDF-API2@rt.cpan.org> writes: Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=113293 > > > Here's a brief update on this ticket: I spent a large chunk of > yesterday trying to isolate the problem, but haven't gotten there yet. > > Observations: > > - The Page content stream in the source PDF is getting replaced by an > object stream during an unrelated operation (copying a different > object from the source PDF to the target PDF). > > - With the provided luatex.pdf, the problem always happens on the > second page to be included (reversing the order from 1, 2 to 2, 1 > causes page 1 to be corrupt). > > - importPageIntoForm calls walk_obj (see the $ssk foreach loop) which > calls itself recursively. It calls the realise method on the source > object if it's an indirect object. On the second page to be included, > inside the first time importPageIntoForm calls walk_obj, but not > necessarily the outer walk_obj call, this realise call is the one that > causes the object stream to get embedded in the wrong place in the > source PDF. > > - The realise method just calls read_obj from .../Basic/PDF/FIle.pm > > - read_obj calls read_objnum and merge. The merge function doesn't > appear to be the problem, and it would make sense for read_objnum to > have the bug since it's dealing with object streams. > > That's as far as I got before running out of time yesterday. It's > possible I'll be able to work on it some more next week, but after > that I'll be away until July. There will be another PDF::API2 release > as soon as this issue gets fixed, so if anyone reading this wants to > take the above and run with it, I'll happily give you full credit for > a patch fixing the bug. :-)

Ok, I think at least I found where the problem is: diff --git a/lib/PDF/API2/Basic/PDF/File.pm b/lib/PDF/API2/Basic/PDF/File.pm index 0544aca..bd9fca7 100644 --- a/lib/PDF/API2/Basic/PDF/File.pm +++ b/lib/PDF/API2/Basic/PDF/File.pm @@ -545,14 +545,9 @@ sub readval { $value = $2; $str =~ s/^([0-9]+)$ws_char+([0-9]+)$ws_char+obj//s; ($obj, $str) = $self->readval($str, %opts, 'objnum' => $num, 'objgen' => $value); - if ($result = $self->test_obj($num, $value)) { - $result->merge($obj); - } - else { $result = $obj; $self->add_obj($result, $num, $value); $result->{' realised'} = 1; - } $str = update($fh, $str) if $update; # thanks to kundrat@kundrat.sk $str =~ s/^endobj//; } This patch appear to fix the issue in the testfile, but it's totally unclear if it has side effects or if it's doing the right thing. Let's consider this a starting point. -- Marco

Tue Jun 07 13:41:59 2016 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:	Tue, 07 Jun 2016 19:41:44 +0200
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

And some notes about the readval method, which is called recursively. 1. From read_objnum we pass these options: ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0); However, in the readval code, I can't find an usage for objnum and objgen, so it seems to me they are ignored, but could be mistaken. I suspect those values not obeyed is what lead to the caching problem. 2. From read_objnum my $stream = "$num 0 obj" . substr($objects, $start, $length); ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0); Unclear if that "$num 0 obj" should be "$num $gen obj" instead. I don't have a PDF with some object revisions, so dunno, just submitted to the attentions. 3. In readval my $update = defined($opts{update}) ? $opts{update} : 1; $str = update($fh, $str); While elsewere we have $str = update($fh, $str) if $update; So at the begin of the method, update is called even if update => 0 is passed. Point 2. and 3. are unrelated to the issue in the ticket, and are probably worth a comment in the code or in the doc to clarify it for the next souls wondering in this area of the code. I'll continue digging on the issue over the next days, assuming I'll have some spare time to throw at this. Cheers -- Marco

Wed Jun 08 11:16:06 2016 melmothx [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:	Wed, 08 Jun 2016 17:15:53 +0200
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Marco Pessotto <melmothx [...] gmail.com>

It looks like the patch is making the pdf.t going into very-very deep recursion, so no, definitively it's not doing the right thing. Kind of stuck here. -- Marco

Thu Oct 06 23:05:54 2016 steve [...] deefs.net - Correspondence added

Found it. The problem actually got introduced when the cross-reference stream was originally read, due to this gotcha in readxrtr: for $xmin ($start...$last) { # ... } # ... $self->{' maxobj'} = $xmin; The problem is that the "$xmin" in the for loop is an implicit local variable, so the $xmin outside the loop was still the default (1) instead of $last (27 in your file). When new_obj gets called later on, it increments maxobj to 2, then clobbers what was in that spot, which happened to be the object stream in your file. Everything we saw after that followed from that mistake. The repo has the fix, and it'll be in the next release. And I'll be very happy to stop banging my head on this problem. :-)

Thu Oct 06 23:06:05 2016 steve [...] deefs.net - Status changed from 'open' to 'resolved'

Thu Oct 06 23:19:59 2016 steve [...] deefs.net - Correspondence added

Show quoted text

> 1. From read_objnum we pass these options: > > ($object) = $self->readval($stream, %opts, objnum => $num, objgen => > $gen, update => 0); > > However, in the readval code, I can't find an usage for objnum and > objgen, so it seems to me they are ignored, but could be mistaken. > > I suspect those values not obeyed is what lead to the caching problem.

They do indeed seem to be ignored. This ended up being unrelated, though. There doesn't seem to be anywhere else in the codebase that uses them either, so I'm removing them to avoid confusion later. Show quoted text

> 2. From read_objnum > > my $stream = "$num 0 obj" . substr($objects, $start, $length); > ($object) = $self->readval($stream, %opts, objnum => $num, objgen => > $gen, update => 0); > > Unclear if that "$num 0 obj" should be "$num $gen obj" instead.

The generation number in object streams is always going to be zero, so even if objgen weren't being ignored, $gen would always be zero. Show quoted text

> 3. In readval > > my $update = defined($opts{update}) ? $opts{update} : 1; > $str = update($fh, $str); > > While elsewere we have $str = update($fh, $str) if $update; > So at the begin of the method, update is called even if update => 0 is > passed.

Hmm, good catch. It didn't end up mattering for this bug because we're passing the whole stream, so it was just adding garbage to the end of it, which is probably going to be the case any time we set $update to 0 anyway, but I've added the "if $update" to that line as well.

Thu Oct 06 23:23:58 2016 steve [...] deefs.net - Status changed from 'resolved' to 'patched'

Mon Oct 10 09:23:41 2016 steve [...] deefs.net - Status changed from 'patched' to 'resolved'

Mon Oct 10 09:23:42 2016 steve [...] deefs.net - Broken in 2.026 added

Mon Oct 10 09:23:42 2016 steve [...] deefs.net - Fixed in 2.029 added