On Sun Nov 22 07:36:50 2020, vadimr wrote:
Show quoted text> Phil, "any" is literally ANY, width can be 1-2-3-4-8, but also 5-6-7-
> 9-10-...1000-..., etc. to PDF architectural integer limit (2**32 - 1).
> What's an integer (byte offset of an object, in particular) so many
> bytes long -- it's beyond comprehension and hardware capabilities and
> practical requirements. See the H.21 end-note in "PDF Reference, sixth
> edition", which explicitly states that the Reference, itself, does not
> impose ANY limit on offset byte width. I don't know what you mean by
> "reading documentation correctly" and finding there allowed widths of
> 1-2-3-4-8.
I was probably thrown by the PDF::API2 implementation, which only allows 1, 2, 3, 4, and 8 byte widths. You're saying that /any/ width up to some enormous number of bytes is theoretically possible? As very little hardware out there probably handles >64 bit integers, a maximum width of 8 is probably adequate.
Show quoted text> So, a Reader, theoretically and nominally, must cope with any width,
> but practically -- see what I said in August about a fix, just a
> character insertion.
So, if some joker decides to provide a PDF with a cross-reference stream field width of 5 bytes (40 bit integer), PDF::API2 (and until I extend it, PDF::Builder) will choke on it? Even though it's legitimate? It shouldn't be too much trouble to handle 5, 6, and 7 byte width fields by padding with x00 bytes (as in the manner of width 3). >8 bytes is probably unreasonable for the next few years, until hardware (and Perl) catches up (i.e., a generation of 128 bit chips).
Show quoted text> However, you raised a valid concern about 32-bit Perls compiled
> without "USE_64_BIT_INT", regardless of them being worth any effort.
> Then, again, I'm repeating myself, have a look at sister packages, how
> they handle the issues -- quite differently from each other (and
> PDF::API2), but BOTH can cope with ANY widths of arbitrary size, not
> just 1-2-3-4-8, and regardless of Perl being 32bit/64bit (of course,
> as long as integer to be decoded fits 32/64 bits, as applicable, --
> i.e. byte string may have leading zeroes).
As I've said before, I can check if the 33 leading bits for a 64 bit (after x00 padding 5, 6, or 7 byte fields) integer field are 0, and just decode the low 32 bits as 'N' format. If it's not an unsigned 32 bit value, we'll just have to throw it to unpack('Q>') and hope for the best that it's a 64 bit Perl. I understand that it will produce a smoking hole in the ground if it's a 32 bit Perl. I suppose I /could/ use some sort of "extended math" package to handle the field value as two 32 bit ints or four 16 bit ints, but I'm not sure it's worth the effort. Do you know how (in general terms) these "sister packages" handle 64 bit integers -- perhaps some sort of extended math?
Show quoted text> > what happens for just 'Q' (as the original code was)
>
> The original code was tested, if ever, using big-endian CPU
Very likely. A (forgivable) testing flaw in PDF::API2 (but how many people have both Big-Endian and Little-Endian machines available?).
Show quoted text> > Why is 'Q' treated differently, and I need to give the byte order
> > explicitly?
>
> ?? Because it's documented so. By design. How is that "Perl issue"?
I just found it odd that unpack's Q is treated differently than N/V/n/v. Yes, that's a Perl issue that it was implemented that way (dependent upon the machine architecture unless explicitly overridden), and not a PDF issue (everything is Network (Big-Endian) order). Why didn't Perl give Q=unsigned Big-Endian 64 bit, q=Little-Endian, R=signed Big-Endian, and r=signed Little-Endian (or something similar)? q is signed, but Endian-ness still has to be explicitly given.
Show quoted text> It's matter of POV -- the N/V (n/v) pairs are peculiar exception, all
> other relevant templates require explicit byte order modifier to work
> in portable manner.
>
> > Is a Parent entry mandatory for a Kid
>
> The Reference has comprehensive Index, I don't think there are any
> ambiguities where and which entries are required. There are trees of
> slightly different breeds. E.g., items of Pages Tree, Name Tree (your
> example), Outlines tree(-like structure) require (1) both Kids/Parent,
> (2) Kids only, (3) Parent only entries, respectively. Evolving
> standard (as PDF was) can finish eclectic, which is OK as long as
> everything is clearly documented.
I've been using the ISO/Adobe final reference for PDF 1.7, 32000_2008.pdf. It has no index. Can you recommend a better PDF reference?
Show quoted text> The "afhacked2.pdf", IIRC, was shown to be horribly broken EXACTLY
> w.r.t. parental relationship in a tree, why would you pick it up as
> example to investigate.
I wanted an example of what appeared to be a real-life, in-the-wild, scrambled Parent/Kid relationships, for testing my new validation code. It /does/ flag it as an error, but does not attempt to correct or fix up anything. The validation code is meant to flag suspicious PDFs so that we don't waste so much time trying to fix PDF::API2/Builder "bugs" which are actually bad PDFs in the first place.
Speaking of which, I have several hundred PDFs that I've accumulated over the years, and which I tested against. Many PDFs refer to objects (e.g., /Font 9 0 R) but there is no such object (9 0 obj) in the file. Many, if not all of those cases, appear to have the missing object on the Free List. I take it that it's OK to refer to an object that's on the Free List, and it will just be ignored?