Bug #133131 for PDF-API2: how do I file an issue for PDF::API2?

Sat Aug 08 11:48:56 2020 Christopher.Papademetrious [...] synopsys.com - Ticket created

Subject:	how do I file an issue for PDF::API2?
Date:	Sat, 8 Aug 2020 15:42:53 +0000
To:	"bug-PDF-API2 [...] rt.cpan.org" <bug-PDF-API2 [...] rt.cpan.org>
From:	Chris Papademetrious <Christopher.Papademetrious [...] synopsys.com>

Hi Steve, I hope this email finds you at all! We're using PDF::API2 at our company, and I ran into a PDF that it doesn't like: Reading 'chrispy/fcdm.pdf'... Can't parse `' near 2316292767724077056 length 0. at /u/doc/perl5/lib/perl5/PDF/API2/Basic/PDF/File.pm line 694. The parser hits a zero-length string and doesn't know what to do with it. I was going to file an issue in the Github repo, but there's no "Issues" tab like I see for other repos. How do you suggest that I proceed? And thank you for taking on ownership of this repo! It's a tremendously powerful library, and there's not an alternative that does everything that it does. ----- Chris Papademetrious Tech Writer, Implementation Group (610) 628-9718 home office (570) 460-6078 cell

Sat Aug 08 11:51:54 2020 chrispitude [...] gmail.com - Correspondence added

Ugh, I didn't realize this was going to create a ticket. I guess you're using rt.cpan.org for issue tracking but Github for revision control?

Sat Aug 08 11:51:55 2020 The RT System itself - Status changed from 'new' to 'open'

Sat Aug 08 12:43:44 2020 PMPERRY [...] cpan.org - Correspondence added

An empty string $str was fed to the readval() method, so it doesn't know what to do (how to handle it). Can you tell us what you were trying to do (read in an existing PDF?) and maybe show a small test case Perl code? Feeding an empty (or nonsense format) string (PDF file content) into this routine will produce this error, but we need to figure out what happened "upstream" that triggered it. A file offset of 232 quadrillion? That's bigger than any PDF file I've ever heard of, and possibly bigger than any filesystem can handle. Something seems to have gone very wrong there, but without knowing what you were doing, it's hard to diagnose. Show quoted text

> there's not an alternative that does everything that it does

If you think that your PDF you're reading in is OK (loads OK in all the readers you try, AND NO READER ASKS PERMISSION TO SAVE THE FILE WHEN YOU EXIT IT), you might give PDF::Builder a try and see if that works any better. Maybe something upstream is different enough to get past whatever is causing the problem. It includes instructions on porting from PDF::API2, but usually just changing "PDF::API2" to "PDF::Builder" in your Perl code will do the job. By the way, if your PDF is PDF version 1.5 or higher, that can blow up PDF::API2. Builder might handle it a little better, but no promises.

Sat Aug 08 14:26:24 2020 chrispitude [...] gmail.com - Correspondence added

Subject:

PDF::API2 unable to open a compressed-stream PDF file

Thanks Phil! Given the history described in your docs and your desire for more aggressive issue resolution, it appears I should indeed be using PDF::Builder instead. I'm trying to open a PDF file written out by an authoring/publishing tool: #!/usr/bin/perl use PDF::Builder; my $pdf = PDF::Builder->open('bad.pdf'); I've isolated the issue to a single-page 7k PDF that fails to load in PDF::API2 and PDF::Builder, yet reads into Ghostscript (and every other tool I've tried) without complaint. I tried decompressing the streams with qpdf --qdf --object-streams=disable bad.pdf bad_decompressed.pdf but PDF::API2 and PDF::Builder are both able to read the stream-decompressed version of the file. I'm attaching the problematic PDF file for your thoughts.

Subject:

bad.pdf

Download bad.pdf
application/pdf 6.9k

Message body not shown because it is not plain text.

Sat Aug 08 16:15:08 2020 chrispitude [...] gmail.com - Correspondence added

The bad.pdf attachment in the previous message is 7164 bytes: % ls -l bad.pdf -rwxrwxrwx 1 chrispy chrispy 7164 Aug 8 14:21 bad.pdf It fails with an error about parsing an endstream construct: ==== Reading 'bad.pdf'... Can't parse `endstream endobj startxref 6737 %%EOF ' near 7165 length 40. at /usr/local/share/perl/5.26.1/PDF/Builder/Basic/PDF/File.pm line 726. ==== I have another PDF file, bad2.pdf (attached to this message) that is 7406 bytes, but fails with an error about parsing an empty string: ==== Reading 'bad2.pdf'... Can't parse `' near 6419318318863220736 length 0. at /usr/local/share/perl/5.26.1/PDF/Builder/Basic/PDF/File.pm line 726. ==== If I add or remove a few characters in the authoring tool, the resulting PDF fails with either the endstream message or the empty-string message. I suspect they're both variants of the same boundary behavior in the parsing.

Subject:

bad2.pdf

Download bad2.pdf
application/pdf 7.2k

Message body not shown because it is not plain text.

Sat Aug 08 19:31:23 2020 PMPERRY [...] cpan.org - Correspondence added

I can see a major problem right away. A PDF-1.4 file should have 'startxref' pointing to the cross-reference TABLE headed by 'xref' and the starting object and length. Instead, your 'bad.pdf' startxref is pointing to an object of type XRef, which appears to be a cross-reference STREAM. The minimum PDF level for a cross-reference stream is 1.5. Any idea how a PDF with 1.5 level features was labeled as 1.4? I don't think it's going to work.

Sat Aug 08 19:47:14 2020 PMPERRY [...] cpan.org - Correspondence added

When I uncompressed the content streams, using PDFtk, it also worked with the 'open' call. I see that PDFtk gave me a proper PDF-1.4 cross-reference table structure when it created the new PDF, and this may have happened to you with qpdf.

Sun Aug 09 08:19:09 2020 chrispitude [...] gmail.com - Correspondence added

Hi Phil, Thanks for debugging the problem! Now I know the issue is with the PDF itself, not the parsing code. I'll take this up with the software vendor. And I'm going to have a closer look at PDF::Builder today - thanks for your efforts on this too! Admins, feel free to close this ticket. (I don't have permissions to do so.)

Sun Aug 09 10:08:17 2020 PMPERRY [...] cpan.org - Correspondence added

Chris, I can't guarantee that is the problem, but using a cross-reference stream in a PDF-1.4 document looks very suspicious to me. Certainly you should take it up with the vendor and see if they have an explanation for that (and why they feel it's OK to do). Please get back to us with whatever you find. Since you opened this ticket by mail, I'm not sure you can close it yourself. If you can't, only Steve (the owner) can. If you have any issues with PDF::Builder, please use its GitHub issues area to discuss them. Please don't clutter up PDF::API2's CPAN RT area with other products' issues.

Mon Aug 10 07:01:01 2020 futuramedium [...] yandex.ru - Correspondence added

Show quoted text

> Now I know the issue is with the PDF > itself, not the parsing code.

Actually, it is the issue with parsing code, a bug in PDF::API2. In this line: https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Basic/PDF/File.pm#L1146 the template should be 'Q>'. Moreover, limiting possible widths to 1,2,3,4,8 bytes in enclosing subroutine is arbitrary, but (1) at least there's provision to die noisily; (2) possibility of necessity of any value above 4 is extremely low. There's probably not much need to re-write the sub except the '>' insertion, but there's PDF::Tiny, CAM::PDF source for inspiration, if you decide otherwise. + The guys who use 8 bytes to encode offsets in their PDF lib are lazy indeed. The version in header is overridden by document catalogue entry, so it doesn't matter.

Mon Aug 10 19:42:46 2020 chrispitude [...] gmail.com - Correspondence added

futuramedium, THANK YOU!! Out of 186 PDF that exhibited the problem, your suggested code change fixed all of them. And the same code change worked equally well for both PDF::API2 and PDF::Builder. We just installed the latest release of our publishing software, and it uses Apache FOP for publishing, so it's possible that this problematic output construct might occur outside my organization too. Phil, do you want me to file a PDF::Builder issue for this code change? And it looks like this ticket should indeed stay open for the change to be made in PDF::API2! Thanks again to everyone who jumped in and quickly bashed this one out.

Mon Aug 10 21:48:54 2020 PMPERRY [...] cpan.org - Correspondence added

No need to open a PDF::Builder bug ticket... I have this one on file. I will consider putting the patch in once I have a chance to carefully examine it and determine if it's really a useful fix, or is just papering over a PDF bug. I'm still very concerned over apparently putting a cross-reference stream (PDF-1.5) into a PDF-1.4 document. It would be nice to hear your vendor's explanation of why they did it that way. I'm reluctant to allow a PDF-1.5 feature in reading in a PDF-1.4 document. If I can detect that it's a cross-reference stream, I might be able to bump up the version to 1.5 on the fly, but I have to carefully look at it first. PDF::API2 (and PDF::Builder) had some code added recently to handle cross-reference streams without blowing up, but I want to make sure I understand the full picture before I start slapping in ad-hoc fixes.

Mon Nov 09 21:29:41 2020 PMPERRY [...] cpan.org - Correspondence added

Christopher, did you ever get a chance to ask your vendor why they have a cross-reference stream in a file claimed to be level 1.4? With Vadim's patch, code that Steve put in earlier to handle this PDF 1.5 feature may be getting it through OK, but I'm still worried about what else may be waiting to go wrong. I still say that the PDF is wrong. Vadim, if I read the cross-reference stream documentation correctly, it allows widths of 1, 2, 3, 4, and 8. 3 is actually treated as 4, with a 00 byte shoved in front. It seems to say that these fields should be Big-Endian (MSB). Are we in agreement? Then why (without the > flag) does it properly handle 2, 3, and 4 byte lengths, but treat 8 byte widths as Little-Endian? In other words, why don't 2, 3, and 4 widths require '>' too? Did the PDF writer create the value Little-Endian, and the '>' turns it around (flips and then reads it Big-Endian), and if so, why don't 2, 3, and 4 need this treatment? I'm just uncomfortable with putting this patch in until I understand why '8' is so different -- or was the PDF created incorrectly, with 64-bit integers flipped around? I have been unable to read the stream directly, as it's flate compressed, and when uncompressing it, PDFtk changes it to a cross-reference TABLE even if I first change the PDF version to 1.5. So, I can't see how the original value was stored (as Big-Endian or Little-Endian).

Mon Nov 09 22:00:32 2020 PMPERRY [...] cpan.org - Correspondence added

After I sent off the last post, I figure out how to look at the cross-reference table data (dumping it in File.pm). Only widths 2 and 8 are used, and in both, the data appears to be Big-Endian (MSB). So this aspect of the PDF, anyway, appears to have been written correctly. 'n' and 'N' codes don't allow '>', so that's a moot point. So the question remains, why does 'Q' require an explicit '>' to be read correctly, and will this change on machines which are natively Big-Endian (non-Intel chips)? Is it that 'Q' allows either way, and if you're possibly transferring data across chip types, you'd better specify explicitly that the PDF data you're unpacking with 'Q' is Big-Endian? Also, if writing (pack) with 'Q', will it write Little-Endian on an Intel chip?

Sat Nov 14 17:19:17 2020 futuramedium [...] yandex.ru - Correspondence added

Hi, Show quoted text

> I read the cross-reference stream documentation correctly, it allows widths of 1, 2, 3, 4, and 8

No, width can be any, and is not limited, btw Show quoted text

> So the question remains, why does 'Q' require an explicit '>' to be read correctly, and will this change on machines which are natively Big-Endian (non-Intel chips)?

It won't. To quote the Reference: "Fields [in a cross-reference stream] requiring more than one byte are stored with the high-order byte first." Show quoted text

> file claimed to be level 1.4

It didn't. The Version ("Optional; PDF 1.4") entry in catalog dictionary takes precedence "if later than the version specified in the file’s header". So, 1.4-compliant consumer must consult this entry first, before making final decision about version. It follows, that 1.5-compliant consumer (which PDF::API2 is) must try to read cross-reference stream if required, to check that "Version" entry, even if header says "1.4". It's a bit of a conundrum, I agree, but it's how things are.

Sun Nov 15 20:00:43 2020 PMPERRY [...] cpan.org - Correspondence added

On Sat Nov 14 17:19:17 2020, vadimr wrote: Show quoted text

> Hi, >

> > I read the cross-reference stream documentation correctly, it allows > > widths of 1, 2, 3, 4, and 8

> > No, width can be any, and is not limited, btw

I'm not following you. The width field is one of those widths (integer byte size), isn't it? The resulting width integer can of course be any legitimate positive integer that a field of that size can hold. Show quoted text

> > So the question remains, why does 'Q' require an explicit '>' to be > > read correctly, and will this change on machines which are natively > > Big-Endian (non-Intel chips)?

> > It won't. To quote the Reference: "Fields [in a cross-reference > stream] requiring more than one byte are stored with the high-order > byte first."

Let me clarify what I was asking. I can see that the data *was* high-order byte first ("network order"/Big Endian) in the file, which is correct. What I was asking was what happens for just 'Q' (as the original code was), as opposed to 'Q>'. If using just 'Q', it appears that my Intel CPU reads (and writes) in low-order byte (Little Endian). Why is 'Q' treated differently, and I need to give the byte order explicitly? This isn't a PDF::API2/Builder issue; it's a Perl issue. This also brings up the problem that if your Perl isn't compiled for 64 bit integers, supposedly it's going to blow up on a 'Q' (or 'Q>') unpack. If this particular PDF was being processed on a 32 bit Perl, it's likely to fall over dead. I wonder if we should use the Config package to query if Perl supports 64 bit integers, and if not, see if we can treat it as a 32 bit (unsigned) integer by just unpacking the bottom 32 bits (first checking that the first 33 bits, including the bottom's sign, are 0)? Or, is 32 bit Perl so rare these days that we shouldn't bother? This appears to be the only place that 64 bits are baked into the library. Show quoted text

> > file claimed to be level 1.4

> > It didn't. The Version ("Optional; PDF 1.4") entry in catalog > dictionary takes precedence "if later than the version specified in > the file’s header". So, 1.4-compliant consumer must consult this entry > first, before making final decision about version. It follows, that > 1.5-compliant consumer (which PDF::API2 is) must try to read cross- > reference stream if required, to check that "Version" entry, even if > header says "1.4". It's a bit of a conundrum, I agree, but it's how > things are.

Ah, poking through the PDF I can see a Catalog entry for /Version /1.5. Still, that's pretty sloppy to declare 1.4 in the header and then 1.5 deep down inside. Anyway, it looks like I'm going to have to figure out how to read this Catalog(s) for an overriding Version entry, before doing any checking for version-dependent features.

Mon Nov 16 09:33:23 2020 chrispitude [...] gmail.com - Correspondence added

On Mon Nov 09 21:29:41 2020, PMPERRY wrote: Show quoted text

> Christopher, did you ever get a chance to ask your vendor why they > have a cross-reference stream in a file claimed to be level 1.4? With > Vadim's patch, code that Steve put in earlier to handle this PDF 1.5 > feature may be getting it through OK, but I'm still worried about what > else may be waiting to go wrong. I still say that the PDF is wrong.

Hi Phil, The product is PDF Chemistry, a DITA publishing tool from Syncrosoft. PDF Chemistry uses Apache FOP internally for PDF creation, but I did not learn any specifics beyond that. On Sat Nov 14 17:19:17 2020, vadimr wrote: Show quoted text

> It didn't. The Version ("Optional; PDF 1.4") entry in catalog > dictionary takes precedence "if later than the version specified in > the file’s header". So, 1.4-compliant consumer must consult this entry > first, before making final decision about version. It follows, that > 1.5-compliant consumer (which PDF::API2 is) must try to read cross- > reference stream if required, to check that "Version" entry, even if > header says "1.4". It's a bit of a conundrum, I agree, but it's how > things are.

Hi Vadmin, Possibly related, possibly not... Any utility that uses a Poppler/Cairo version from around 2009 fails with the following error: Error: PDF file is damaged - attempting to reconstruct xref table... Error: Couldn't find trailer dictionary Error: Couldn't read xref table However, later versions read the PDF successfully. I wonder if they fixed a similar bug to what you describe?

Fri Nov 20 13:01:21 2020 PMPERRY [...] cpan.org - Correspondence added

A little bit of a side trip -- I have added code to PDF::Builder to validate the structure of the PDF being read, and in the process, pick up any Version override (so, for example, bad.pdf is recognized as PDF-1.5 and I don't unnecessarily flag the cross-reference stream as non-1.4). While I'm parsing the PDF, I'm looking at, among other things, Parent entries. I notice that in bad.pdf, object 24 lists objects 19-23 as its children (/Kids), but none of them list a /Parent (presumably back to 24). Is a Parent entry mandatory for a Kid (and possibly some other parent-child relationships), or is it optional? Does it always have to point back to the object who claims this object as its Kid, or is it legal to point somewhere else? I'm thinking of ticket 130722's afhacked2.pdf's object 4 declaring object 6 to be its child (/Kid) but so does object 9, and 6 points back to 9 as its Parent. That sounds fishy to me. The PDF 1.7 Reference sometimes calls a Parent mandatory, but then often omits a Parent from the example.

Sun Nov 22 07:36:50 2020 futuramedium [...] yandex.ru - Correspondence added

Christopher, more likely older versions of Poppler/Cairo didn't support 1.5 features. Phil, "any" is literally ANY, width can be 1-2-3-4-8, but also 5-6-7-9-10-...1000-..., etc. to PDF architectural integer limit (2**32 - 1). What's an integer (byte offset of an object, in particular) so many bytes long -- it's beyond comprehension and hardware capabilities and practical requirements. See the H.21 end-note in "PDF Reference, sixth edition", which explicitly states that the Reference, itself, does not impose ANY limit on offset byte width. I don't know what you mean by "reading documentation correctly" and finding there allowed widths of 1-2-3-4-8. So, a Reader, theoretically and nominally, must cope with any width, but practically -- see what I said in August about a fix, just a character insertion. However, you raised a valid concern about 32-bit Perls compiled without "USE_64_BIT_INT", regardless of them being worth any effort. Then, again, I'm repeating myself, have a look at sister packages, how they handle the issues -- quite differently from each other (and PDF::API2), but BOTH can cope with ANY widths of arbitrary size, not just 1-2-3-4-8, and regardless of Perl being 32bit/64bit (of course, as long as integer to be decoded fits 32/64 bits, as applicable, -- i.e. byte string may have leading zeroes). What follows is rather off-topic. Show quoted text

> what happens for just 'Q' (as the original code was)

The original code was tested, if ever, using big-endian CPU Show quoted text

> Why is 'Q' treated differently, and I need to give the byte order explicitly?

?? Because it's documented so. By design. How is that "Perl issue"? It's matter of POV -- the N/V (n/v) pairs are peculiar exception, all other relevant templates require explicit byte order modifier to work in portable manner. Show quoted text

> Is a Parent entry mandatory for a Kid

The Reference has comprehensive Index, I don't think there are any ambiguities where and which entries are required. There are trees of slightly different breeds. E.g., items of Pages Tree, Name Tree (your example), Outlines tree(-like structure) require (1) both Kids/Parent, (2) Kids only, (3) Parent only entries, respectively. Evolving standard (as PDF was) can finish eclectic, which is OK as long as everything is clearly documented. The "afhacked2.pdf", IIRC, was shown to be horribly broken EXACTLY w.r.t. parental relationship in a tree, why would you pick it up as example to investigate.

Sun Nov 22 15:00:02 2020 PMPERRY [...] cpan.org - Correspondence added

On Sun Nov 22 07:36:50 2020, vadimr wrote: Show quoted text

> Phil, "any" is literally ANY, width can be 1-2-3-4-8, but also 5-6-7- > 9-10-...1000-..., etc. to PDF architectural integer limit (2**32 - 1). > What's an integer (byte offset of an object, in particular) so many > bytes long -- it's beyond comprehension and hardware capabilities and > practical requirements. See the H.21 end-note in "PDF Reference, sixth > edition", which explicitly states that the Reference, itself, does not > impose ANY limit on offset byte width. I don't know what you mean by > "reading documentation correctly" and finding there allowed widths of > 1-2-3-4-8.

I was probably thrown by the PDF::API2 implementation, which only allows 1, 2, 3, 4, and 8 byte widths. You're saying that /any/ width up to some enormous number of bytes is theoretically possible? As very little hardware out there probably handles >64 bit integers, a maximum width of 8 is probably adequate. Show quoted text

> So, a Reader, theoretically and nominally, must cope with any width, > but practically -- see what I said in August about a fix, just a > character insertion.

So, if some joker decides to provide a PDF with a cross-reference stream field width of 5 bytes (40 bit integer), PDF::API2 (and until I extend it, PDF::Builder) will choke on it? Even though it's legitimate? It shouldn't be too much trouble to handle 5, 6, and 7 byte width fields by padding with x00 bytes (as in the manner of width 3). >8 bytes is probably unreasonable for the next few years, until hardware (and Perl) catches up (i.e., a generation of 128 bit chips). Show quoted text

> However, you raised a valid concern about 32-bit Perls compiled > without "USE_64_BIT_INT", regardless of them being worth any effort. > Then, again, I'm repeating myself, have a look at sister packages, how > they handle the issues -- quite differently from each other (and > PDF::API2), but BOTH can cope with ANY widths of arbitrary size, not > just 1-2-3-4-8, and regardless of Perl being 32bit/64bit (of course, > as long as integer to be decoded fits 32/64 bits, as applicable, -- > i.e. byte string may have leading zeroes).

As I've said before, I can check if the 33 leading bits for a 64 bit (after x00 padding 5, 6, or 7 byte fields) integer field are 0, and just decode the low 32 bits as 'N' format. If it's not an unsigned 32 bit value, we'll just have to throw it to unpack('Q>') and hope for the best that it's a 64 bit Perl. I understand that it will produce a smoking hole in the ground if it's a 32 bit Perl. I suppose I /could/ use some sort of "extended math" package to handle the field value as two 32 bit ints or four 16 bit ints, but I'm not sure it's worth the effort. Do you know how (in general terms) these "sister packages" handle 64 bit integers -- perhaps some sort of extended math? Show quoted text

> > what happens for just 'Q' (as the original code was)

> > The original code was tested, if ever, using big-endian CPU

Very likely. A (forgivable) testing flaw in PDF::API2 (but how many people have both Big-Endian and Little-Endian machines available?). Show quoted text

> > Why is 'Q' treated differently, and I need to give the byte order > > explicitly?

> > ?? Because it's documented so. By design. How is that "Perl issue"?

I just found it odd that unpack's Q is treated differently than N/V/n/v. Yes, that's a Perl issue that it was implemented that way (dependent upon the machine architecture unless explicitly overridden), and not a PDF issue (everything is Network (Big-Endian) order). Why didn't Perl give Q=unsigned Big-Endian 64 bit, q=Little-Endian, R=signed Big-Endian, and r=signed Little-Endian (or something similar)? q is signed, but Endian-ness still has to be explicitly given. Show quoted text

> It's matter of POV -- the N/V (n/v) pairs are peculiar exception, all > other relevant templates require explicit byte order modifier to work > in portable manner. >

> > Is a Parent entry mandatory for a Kid

> > The Reference has comprehensive Index, I don't think there are any > ambiguities where and which entries are required. There are trees of > slightly different breeds. E.g., items of Pages Tree, Name Tree (your > example), Outlines tree(-like structure) require (1) both Kids/Parent, > (2) Kids only, (3) Parent only entries, respectively. Evolving > standard (as PDF was) can finish eclectic, which is OK as long as > everything is clearly documented.

I've been using the ISO/Adobe final reference for PDF 1.7, 32000_2008.pdf. It has no index. Can you recommend a better PDF reference? Show quoted text

> The "afhacked2.pdf", IIRC, was shown to be horribly broken EXACTLY > w.r.t. parental relationship in a tree, why would you pick it up as > example to investigate.

I wanted an example of what appeared to be a real-life, in-the-wild, scrambled Parent/Kid relationships, for testing my new validation code. It /does/ flag it as an error, but does not attempt to correct or fix up anything. The validation code is meant to flag suspicious PDFs so that we don't waste so much time trying to fix PDF::API2/Builder "bugs" which are actually bad PDFs in the first place. Speaking of which, I have several hundred PDFs that I've accumulated over the years, and which I tested against. Many PDFs refer to objects (e.g., /Font 9 0 R) but there is no such object (9 0 obj) in the file. Many, if not all of those cases, appear to have the missing object on the Free List. I take it that it's OK to refer to an object that's on the Free List, and it will just be ignored?

Tue Nov 24 06:41:56 2020 futuramedium [...] yandex.ru - Correspondence added

Show quoted text

> "sister packages"

In general terms: $_ = 'bytes'; $i = hex unpack 'H'.(2*length), $_; # PDF::Tiny @b = unpack 'C*', $_; $i = 0; ($i <<= 8) += shift @b while @b; # CAM::PDF Show quoted text

> ...index. Can you recommend a better PDF reference?

https://en.wikipedia.org/wiki/PDF leads to https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf Show quoted text

> I take it that it's OK to refer to an object that's on the Free List, and it will just be ignored?

See page 64, link above. If object definition is missing (I think it doesn't matter if number is in Free list), then reference refers to null object. If null object is allowed in particular place, definition absence appears to be "ignored". E.g. if /Font entry in graphics state dict is null, then no font is set by gs operator invocation, just wait until Tf operator. Otherwise, I guess it depends on severity, either "ignored" so as not to disturb a user, or reported as error.

Tue Nov 24 08:35:26 2020 PMPERRY [...] cpan.org - Correspondence added

On Tue Nov 24 06:41:56 2020, vadimr wrote: Show quoted text

> > "sister packages"

> > In general terms: > > $_ = 'bytes'; > > $i = hex unpack 'H'.(2*length), $_; # PDF::Tiny > > @b = unpack 'C*', $_; $i = 0; ($i <<= 8) += shift @b while @b; # > CAM::PDF

It looks like they're just doing the same thing as unpack('Q>'), but at a more primitive level. This still doesn't address the problem of what to do if the result doesn't fit in a 32-bit (4-byte) integer (i.e., it overflows -- does Perl switch to double without losing precision?). BTW, I went ahead and 1) handled 5, 6, and 7 byte integers by left-padding with x00, and 2) check if the top 32 bits are x00 and if so, use unpack('N') on the lower half, only calling unpack('Q>') as a last resort (likely to blow up if not Perl-64). I think it will be very rare (for a few years still) to encounter a field that's actually more than 4 billion in value. If 32-bit Perl just handles that as a double, I'll have to revisit this conversion. Show quoted text

> > ...index. Can you recommend a better PDF reference?

> > https://en.wikipedia.org/wiki/PDF leads to > https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1- > 7.pdf

Thanks, that looks useful (at least, it has an index!). Show quoted text

> > I take it that it's OK to refer to an object that's on the Free List, > > and it will just be ignored?

> > See page 64, link above. If object definition is missing (I think it > doesn't matter if number is in Free list), then reference refers to > null object. If null object is allowed in particular place, definition > absence appears to be "ignored". E.g. if /Font entry in graphics state > dict is null, then no font is set by gs operator invocation, just wait > until Tf operator. Otherwise, I guess it depends on severity, either > "ignored" so as not to disturb a user, or reported as error.

OK, a reference to an undefined object is usually just ignored, unless it leads to the gears getting jammed. I'll make sure that at worst, it's flagged as a 'note' (informational) message, not an error or warning (and thus normally will not be seen). Thanks for all the info!

Tue Nov 24 10:04:23 2020 futuramedium [...] yandex.ru - Correspondence added

Show quoted text

> This still doesn't address the problem of what to do if the result doesn't fit in a 32-bit (4-byte) integer (i.e., it overflows -- does Perl switch to double without losing precision?)

I think a person who tries to open a 4 Gb PDF file with 32bit build of Perl is problem himself. Anyway PDF::API2 slurps its input and will fail long before dealing with hypothetical overflow, but the latter is easily investigated with a one-liner.