On Tue Feb 28 01:26:45 2017, SPROUT wrote:
Show quoted text> On Mon Feb 27 10:51:28 2017, CNIGHS wrote:
> > I think this might be related to 120346 after all. This is suspect:
> >
> > m'/Pages\s+(\d+)\s{1,2}\d+\s{1,2}R'os
> >
> > at
https://github.com/cnighswonger/PDF-
> > Reuse/blob/bugs/120401/lib/PDF/Reuse.pm#L6062
> >
> > I wonder if the whitespace issue is causing things to get confused,
> > and we end up with the entire original pdf rather than just the
> > page(s) we wanted?
>
> I don’t believe it is. The PDF in question is generated by ABBYY
> FineReader, which uses single spaces in references. (And it is 169
> MB, so it is a bit hard to share.)
>
> I tried extracting a couple of pages from it with PDF::API2 and then
> extracting one page from the result with PDF::Reuse, and the problem
> went away.
>
> I’m still looking into it.
I’ve found the problem. ABBYY FineReader sometimes puts a couple of whitespace characters at the offset indicated by startxref, before the xref keyword. PDF::Reuse assumes that the word ‘xref’, followed by a return, occurs at that location, so it skips 5 bytes before reading the xref entries. So for the PDF in question it was starting at the ‘f’ in xref and, not seeing any numbers on that line, assumed it had reached the end.
Since the PDF was linearized and had two cross-reference tables, it was only reading the first one (with about 50 entries), whereas the combined cross-reference table was supposed to have nearly two thousand entries.
The way PDF::Reuse reads objects is interesting. It sorts all the offsets in the xref table and then uses them to determine how many bytes to read for each object, under the assumption that the next offset marks the end of the object. (Makes sense. It also makes PDF::Reuse immune to bugs like #120397.)
if the xref table was misread, then the objects whose xref entries were not read will be considered part of whichever object preceded them in the file. So *that* explains the file bloat.
With the attached patch, PDF::Reuse will extract a page from my 169.5 MB file in a split second, rather than 15 seconds.