Bug #120401 for PDF-Reuse: Reads the entire contents when extracting a single page

Sun Feb 26 18:21:15 2017 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Ticket created

Subject:

Reads the entire contents when extracting a single page

Both prDoc and prSinglePage read the entire contents of the source PDF, even when extracting just a single page. I tried extracting a page from a 160MB PDF with: prFile("pr.pdf"); prSinglePage('source.pdf', 1); prEnd(); And it produced a 160MB file with one page. Since I made the same mistake when writing PDF::Tiny, my guess is that you are following the page’s /Parent pointer, which points to the page tree, which points to all the other pages, and dragging them in, too, when importing a page.

Mon Feb 27 09:10:16 2017 cnighs [...] cpan.org - Correspondence added

Do you think that this is this related to bug #120346

Mon Feb 27 09:10:16 2017 The RT System itself - Status changed from 'new' to 'open'

Mon Feb 27 09:10:27 2017 cnighs [...] cpan.org - Taken

Mon Feb 27 10:33:44 2017 cnighs [...] cpan.org - Correspondence added

On Mon Feb 27 09:10:16 2017, CNIGHS wrote: Show quoted text

> Do you think that this is this related to bug #120346

Actually I see it is not. I have noticed this issue before. I'm reading over the code trying to understand what's going on. Unfortunately, I did not write the original code...

Mon Feb 27 10:38:31 2017 cnighs [...] cpan.org - Correspondence added

I've created a branch for this bug. Here is the relevant sub (I think): https://github.com/cnighswonger/PDF-Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L5965

Mon Feb 27 10:51:28 2017 cnighs [...] cpan.org - Correspondence added

I think this might be related to 120346 after all. This is suspect: m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os at https://github.com/cnighswonger/PDF-Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062 I wonder if the whitespace issue is causing things to get confused, and we end up with the entire original pdf rather than just the page(s) we wanted?

Tue Feb 28 01:26:45 2017 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Mon Feb 27 10:51:28 2017, CNIGHS wrote: Show quoted text

> I think this might be related to 120346 after all. This is suspect: > > m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os > > at https://github.com/cnighswonger/PDF- > Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062 > > I wonder if the whitespace issue is causing things to get confused, > and we end up with the entire original pdf rather than just the > page(s) we wanted?

I don’t believe it is. The PDF in question is generated by ABBYY FineReader, which uses single spaces in references. (And it is 169 MB, so it is a bit hard to share.) I tried extracting a couple of pages from it with PDF::API2 and then extracting one page from the result with PDF::Reuse, and the problem went away. I’m still looking into it.

Tue Feb 28 14:13:05 2017 $_ = 'spro^^*%*^6ut# [...] &$%*c>#!^!#&!pan.org'; y/a-z.@//cd; print - Correspondence added

On Tue Feb 28 01:26:45 2017, SPROUT wrote: Show quoted text

> On Mon Feb 27 10:51:28 2017, CNIGHS wrote:

> > I think this might be related to 120346 after all. This is suspect: > > > > m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os > > > > at https://github.com/cnighswonger/PDF- > > Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062 > > > > I wonder if the whitespace issue is causing things to get confused, > > and we end up with the entire original pdf rather than just the > > page(s) we wanted?

> > I don’t believe it is. The PDF in question is generated by ABBYY > FineReader, which uses single spaces in references. (And it is 169 > MB, so it is a bit hard to share.) > > I tried extracting a couple of pages from it with PDF::API2 and then > extracting one page from the result with PDF::Reuse, and the problem > went away. > > I’m still looking into it.

I’ve found the problem. ABBYY FineReader sometimes puts a couple of whitespace characters at the offset indicated by startxref, before the xref keyword. PDF::Reuse assumes that the word ‘xref’, followed by a return, occurs at that location, so it skips 5 bytes before reading the xref entries. So for the PDF in question it was starting at the ‘f’ in xref and, not seeing any numbers on that line, assumed it had reached the end. Since the PDF was linearized and had two cross-reference tables, it was only reading the first one (with about 50 entries), whereas the combined cross-reference table was supposed to have nearly two thousand entries. The way PDF::Reuse reads objects is interesting. It sorts all the offsets in the xref table and then uses them to determine how many bytes to read for each object, under the assumption that the next offset marks the end of the object. (Makes sense. It also makes PDF::Reuse immune to bugs like #120397.) if the xref table was misread, then the objects whose xref entries were not read will be considered part of whichever object preceded them in the file. So *that* explains the file bloat. With the attached patch, PDF::Reuse will extract a page from my 169.5 MB file in a split second, rather than 15 seconds.

Subject:

open_7PxhnnHY.txt

--- /Users/sprout/.cpan/build/PDF-Reuse-0.39-tvZelW/lib/PDF/Reuse.pm 2016-09-27 08:48:32.000000000 -0700 +++ /Users/sprout/Perl/dist/PDF-Tiny/lib/PDF/Reuse.pm 2017-02-28 11:03:51.000000000 -0800 @@ -4479,11 +4480,19 @@ my ($i, $root, $antal); $nr++; $oldObject{('xref' . "$nr")} = $xref; # Offset för xref sparas - $xref += 5; sysseek INFIL, $xref, 0; + sysread INFIL, my $buf, 30; + if ($buf =~ /xref/) { + sysseek INFIL, $xref+$-[0]+5, 0; + } + else { + # If the regexp fails (it shouldn't), fall back to the previous + # behaviour. + sysseek INFIL, $xref + 5, 0; + } $xref = 0; my $inrad = ''; - my $buf = ''; + $buf = ''; my $c; sysread INFIL, $c, 1; while ($c =~ m!\s!s)

Tue Feb 28 18:52:16 2017 cnighs [...] cpan.org - Correspondence added

That looks great. I've pushed this up to the related branch and will include it in the next release. https://github.com/cnighswonger/PDF-Reuse/commits/bugs/120401 Thanks for your help!

Tue Feb 28 18:52:22 2017 cnighs [...] cpan.org - Status changed from 'open' to 'patched'