Skip Menu |

This queue is for tickets about the PDF-Reuse CPAN distribution.

Report information
The Basics
Id: 120401
Status: patched
Priority: 0/
Queue: PDF-Reuse

People
Owner: cnighs [...] cpan.org
Requestors: 'spro^^*%*^6ut# [...] &$%*c
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Reads the entire contents when extracting a single page
Both prDoc and prSinglePage read the entire contents of the source PDF, even when extracting just a single page. I tried extracting a page from a 160MB PDF with: prFile("pr.pdf"); prSinglePage('source.pdf', 1); prEnd(); And it produced a 160MB file with one page. Since I made the same mistake when writing PDF::Tiny, my guess is that you are following the page’s /Parent pointer, which points to the page tree, which points to all the other pages, and dragging them in, too, when importing a page.
Do you think that this is this related to bug #120346
On Mon Feb 27 09:10:16 2017, CNIGHS wrote: Show quoted text
> Do you think that this is this related to bug #120346
Actually I see it is not. I have noticed this issue before. I'm reading over the code trying to understand what's going on. Unfortunately, I did not write the original code...
I've created a branch for this bug. Here is the relevant sub (I think): https://github.com/cnighswonger/PDF-Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L5965
I think this might be related to 120346 after all. This is suspect: m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os at https://github.com/cnighswonger/PDF-Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062 I wonder if the whitespace issue is causing things to get confused, and we end up with the entire original pdf rather than just the page(s) we wanted?
On Mon Feb 27 10:51:28 2017, CNIGHS wrote: Show quoted text
> I think this might be related to 120346 after all. This is suspect: > > m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os > > at https://github.com/cnighswonger/PDF- > Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062 > > I wonder if the whitespace issue is causing things to get confused, > and we end up with the entire original pdf rather than just the > page(s) we wanted?
I don’t believe it is. The PDF in question is generated by ABBYY FineReader, which uses single spaces in references. (And it is 169 MB, so it is a bit hard to share.) I tried extracting a couple of pages from it with PDF::API2 and then extracting one page from the result with PDF::Reuse, and the problem went away. I’m still looking into it.
On Tue Feb 28 01:26:45 2017, SPROUT wrote: Show quoted text
> On Mon Feb 27 10:51:28 2017, CNIGHS wrote:
> > I think this might be related to 120346 after all. This is suspect: > > > > m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os > > > > at https://github.com/cnighswonger/PDF- > > Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062 > > > > I wonder if the whitespace issue is causing things to get confused, > > and we end up with the entire original pdf rather than just the > > page(s) we wanted?
> > I don’t believe it is. The PDF in question is generated by ABBYY > FineReader, which uses single spaces in references. (And it is 169 > MB, so it is a bit hard to share.) > > I tried extracting a couple of pages from it with PDF::API2 and then > extracting one page from the result with PDF::Reuse, and the problem > went away. > > I’m still looking into it.
I’ve found the problem. ABBYY FineReader sometimes puts a couple of whitespace characters at the offset indicated by startxref, before the xref keyword. PDF::Reuse assumes that the word ‘xref’, followed by a return, occurs at that location, so it skips 5 bytes before reading the xref entries. So for the PDF in question it was starting at the ‘f’ in xref and, not seeing any numbers on that line, assumed it had reached the end. Since the PDF was linearized and had two cross-reference tables, it was only reading the first one (with about 50 entries), whereas the combined cross-reference table was supposed to have nearly two thousand entries. The way PDF::Reuse reads objects is interesting. It sorts all the offsets in the xref table and then uses them to determine how many bytes to read for each object, under the assumption that the next offset marks the end of the object. (Makes sense. It also makes PDF::Reuse immune to bugs like #120397.) if the xref table was misread, then the objects whose xref entries were not read will be considered part of whichever object preceded them in the file. So *that* explains the file bloat. With the attached patch, PDF::Reuse will extract a page from my 169.5 MB file in a split second, rather than 15 seconds.
Subject: open_7PxhnnHY.txt
--- /Users/sprout/.cpan/build/PDF-Reuse-0.39-tvZelW/lib/PDF/Reuse.pm 2016-09-27 08:48:32.000000000 -0700 +++ /Users/sprout/Perl/dist/PDF-Tiny/lib/PDF/Reuse.pm 2017-02-28 11:03:51.000000000 -0800 @@ -4479,11 +4480,19 @@ my ($i, $root, $antal); $nr++; $oldObject{('xref' . "$nr")} = $xref; # Offset för xref sparas - $xref += 5; sysseek INFIL, $xref, 0; + sysread INFIL, my $buf, 30; + if ($buf =~ /xref/) { + sysseek INFIL, $xref+$-[0]+5, 0; + } + else { + # If the regexp fails (it shouldn't), fall back to the previous + # behaviour. + sysseek INFIL, $xref + 5, 0; + } $xref = 0; my $inrad = ''; - my $buf = ''; + $buf = ''; my $c; sysread INFIL, $c, 1; while ($c =~ m!\s!s)
That looks great. I've pushed this up to the related branch and will include it in the next release. https://github.com/cnighswonger/PDF-Reuse/commits/bugs/120401 Thanks for your help!