Skip Menu |

This queue is for tickets about the PDF-API2 CPAN distribution.

Report information
The Basics
Id: 130722
Status: open
Priority: 0/
Queue: PDF-API2

People
Owner: Nobody in particular
Requestors: joe [...] printevolved.co.uk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Crash during ->openpage
Date: Thu, 17 Oct 2019 11:02:11 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
See my pull request here: https://github.com/ssimms/pdfapi2/pull/21 But I experienced a crash due to find_prop trying to bubble up to it's parent when that was a PDF::API2::Basic::PDF::Objind: perl -MPDF::API2 -E ' my $p = PDF::API2->open("anon_file_328374.pdf"); $p->openpage(2)' Can't locate object method "find_prop" via package "PDF::API2::Basic::PDF::Objind" at /usr/local/share/perl/5.26.1/PDF/API2/Basic/PDF/Pages.pm line 271. My pull request basically stops this from happening by checking that "parent" is capable of find_prop before proceeding with that. I can't supply the original data unfortunately because it's not mine, but I can test any alternative fixes or what have you if that helps. Cheers, Joe
I looked at this in PDF::Builder, but without a failing PDF file to try, I'm reluctant to put your fix in my code. It doesn't appear that the fix could cause any problem, BUT, without understanding WHY the property isn't found (??) in the first place, this may be just slapping a Band-Aid over a machete wound. If the basic problem is that the PDF is defective in calling for some "property" but failing to include that property (??), it might be good to also issue a warning to the user, so they're aware that they're working with a defective PDF. I presume that this PDF loads OK in something like Acrobat Reader? Such readers often do a lot of fixup and will happily load a bad PDF file without saying anything. WHY it would fail to find the find_prop() method (rather than a missing property) would require a defective PDF to trigger the error do it could be traced through. Is there any chance you could provide a public test case? I ran the t-test on my existing PDF::Builder implementation (without your patch), and got 1..2 not ok 1 - Did not explode # Failed test 'Did not explode' # at desktop\rt130722.t line 17. # Can't locate object method "find_prop" via package "PDF::Builder::Basic::PDF::Objind" at C:/Strawberry/perl/site/lib/PDF/Builder/Basic/PDF/Pages.pm line 340. ok 2 - Did opt get anything back # Looks like you failed 1 test of 2. However, running the two-line Perl code you gave, with a known good 12 page PDF, did not result in any messages. By the way, what PDF::API2 version, running under what Perl version?
Subject: Re: [rt.cpan.org #130722] Crash during ->openpage
Date: Fri, 18 Oct 2019 21:58:03 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
Show quoted text
> Is there any chance you could provide a public test case?
I'll see what I can do. Sadly the PDF that causes the problem belongs to a customer. Also when I tried processing it it fixed the problem so it's tricky. I'll have a fish and see if I can get the thing to fail with something I can give you. I have no doubt it's the PDF at fault, I don't know about acrobat off hand but it loads in evince. I could have a look at loading it with PDFBOX in Java or something like that. It'd be nice if it generated a helpful error message that would illuminate things. Show quoted text
> By the way, what PDF::API2 version, running under what Perl version?
In production I am running Perl v5.20.2 + PDF::API2 2.033. However I also reproduced it with Perl 5.26.1 and the latest github HEAD PDF::API2 from before my fork.
Subject: Re: [rt.cpan.org #130722] Crash during ->openpage
Date: Fri, 18 Oct 2019 22:34:15 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
By carefully hand deleting objects  I've managed to get the attached to a state where: 1. It's not got anything much in it.. 2. But it still shows fine in a PDF viewer 3. And still crashes in the same way when I run: perl -MPDF::API2 -E 'PDF::API2->open("afhacked.pdf")->openpage(2)' I hope this helps? Thanks, Joe
Download afhacked.pdf
application/pdf 26.1k

Message body not shown because it is not plain text.

Hmm. If I take your afhacked.pdf and open it in Adobe Acrobat Reader, it does modify (fix) it, as it asks me to save it when closing the Reader. However, when I run your sample program (the openpage), it gives me: C:\Users\Phil\Desktop>rt130722.pl Can't call method "realise" on an undefined value at C:/Strawberry/perl/site/lib/PDF/Builder.pm line 475. That appears to be a different error than you're reporting (your patch is not installed). You confirm that this PDF still produces the find_prop error in unpatched PDF::API2? I would prefer to find what the problem is in the PDF, and report that back to the user, rather than just blindly suppressing the error (either the find_prop or this "realise" error). Anyway, it's in open_scalar(), and is trying to realise() on the opened PDF's 'Root' member, which is apparently undefined. If I can track it down, the fix might be different for Builder than API2, but might still provide help in fixing API2. Looking at the PDF file in gVim, I see that it was produced with InDesign, possibly as a number of single pages that were then combined? There's a lot of binary data, which I might be able to uncompress with PDFtk, although if there is an error in the PDF, that could corrupt the uncompression of the streams. As our code as apparently diverged somewhat, Steve might taking any Builder-specific discussion over to PDF::Builder (https://github.com/PhilterPaper/Perl-PDF-Builder/issues/108 or https://www.catskilltech.com/forum/open-bugs/rt-130722-crash-during-gtopenpage/msg702/).
Subject: Re: [rt.cpan.org #130722] Crash during ->openpage
Date: Sat, 19 Oct 2019 09:34:59 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
Sorry, I think I've confused the issue. I now get a Malformed xref in PDF file from that file. I'll have to try again to make a safe-to-send version. I could swear I double checked this yesterday but the evidence is against me. I'll see what I can do. You might have given me a steer with your idea of uncompressing it, that might make picking out the sensitive parts easier.
Subject: Re: [rt.cpan.org #130722] Crash during ->openpage
Date: Sat, 19 Oct 2019 10:05:53 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
Okay I think this actually achieves what I said I'd done yesterday, no sensitive data and with the version of PDF::API2 included in my linux distro: perl -MPDF::API2 -E 'PDF::API2->open("afhacked2.pdf")->openpage(2)' Can't locate object method "find_prop" via package "PDF::API2::Basic::PDF::Objind" at /usr/local/share/perl/5.26.1/PDF/API2/Basic/PDF/Pages.pm line 271. (that being perl v5.26.1 and PDF::API2 2.036) With my patched version (although my hackery seems to have added an extra warning):  perl -I ~/code/perl/pdfapi2/lib/ -MPDF::API2 -E 'PDF::API2->open("afhacked2.pdf")->openpage(2)' Use of uninitialized value $dat in string at /home/joe/code/perl/pdfapi2/lib/PDF/API2/Basic/PDF/Filter/FlateDecode.pm line 42. This time I removed the objects that I think contained the customer data and used qpdf to make the pdf vaguely usable again. Whatever qpdf is doing doesn't destroy the actual problem, which in this case is good. Sorry for all the dancing around.
Download afhacked2.pdf
application/pdf 7.5k

Message body not shown because it is not plain text.

OK, I'm back to the "find_prop" error with the new PDF. I'll try to poke around and see what I can find, but no promises of when I can get to it.
The Rotate property appears to be uncommon in PDF::API2, although it is possible to add it (rotate method). Your PDF has it in the top level of the page (type Page), which is a child of type Pages. Every openpage() is going to look for the Rotate property, so this code must have been exercised many times before without error. There must be something unusual about the structure of this particular PDF, at least compared to what PDF::API2 is used to working with. I don't think it's the presence (or absence) of the Rotate property itself, as the call to find_prop() is failing to find the method in the first place. It looks like some odd class hierarchy where somewhere up the chain of recursively looking (in the Parent) for a property, a parent class doesn't support find_prop. In this case, we are in the proper class the first time (we DO get to run find_prop), but its immediate parent's class lacks that method. I'm guessing that our starting class level is somewhere below Page (which has the Rotate property). While your patch DOES sidestep the problem (by terminating the upward search early), I'm still concerned about whether it's the right approach. It may be valid if the basic problem is that an unexpected class is the Parent (one that doesn't support find_prop), but I'd be more comfortable if Steve could give his thoughts on it. We can see that the Page includes the Rotate property (value 0), so where are we starting to look, and failing before we get to this Page object (if that's what's happening)? I'm going to set this aside for now, until there's been some more discussion. If nothing happens for a long time, I'll go ahead and put it in PDF::Builder (it's better than nothing, although I'm not comfortable that I fully understand the root causes).
Subject: Re: [rt.cpan.org #130722] Crash during ->openpage
Date: Mon, 28 Oct 2019 09:34:23 +0000
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
Thanks very much for spending time on this. I suppose how valid this situation is is kind-of the nub as to whether dodging it (like I did) or perhaps throwing a more clear error is the better solution. Almost every PDF tool I tried on the original tolerated it on load, but of course almost every PDF tool also corrected (except qpdf) it on save . I don't (yet) understand enough about this level of PDF structure to contribute much, but I might while you are waiting for clarity on this try and use this as an excuse to dig deeper. Thanks again.
The question is then whether this shows a flaw in the PDF (in which case an error message should be given, and fixup done if possible), or is PDF::API2 mishandling a legitimate situation and needs to be fixed. You say that many readers fix up and save the PDF, so it sounds more like the former (PDF flaw). In that case, what are we looking for (to flag) in the PDF? Is encountering a class with missing find_prop() enough? What is the error message that should be output (something that a user could take action on)? Is there any fixup that can be done? This is getting in a bit over my head, WRT to my PDF skills and knowledge.
Please note, these PDF files are invalid. In "afhacked2.pdf", "Pages" root is object 4; and "Parent" of 1st page refers, correctly, to object 4. But "Parent" of 2nd page refers to some rogue dictionary object 9. Which, though happens to have same content as object 4, doesn't belong to pages tree, and so it's not 2nd page's legitimate parent, but an impostor. Same picture, different object numbers, in "afhacked.pdf". I asked someone to export a 2-pages PDF file from InDesign for me, and of course it has no such issues. I strongly suspect there was some messing around with internals, copy-pasting or similar. Though maybe a bug of particular ID version. Why no other PDF consumer complained or even noticed? Well, e.g. offending line, instead of: { return $self->{'Parent'}->find_prop($prop); } could be: { return find_prop($self->{'Parent'}, $prop); } i.e. valid procedural Perl, and search for a "property" would bubble as intended, as opposed to aborted search in case of proposed patch. All depends on internal world-view and hierarchy of other apps. I didn't investigate thoroughly, but, staying with Perl: perl -MCAM::PDF -E "CAM::PDF->new('afhacked2.pdf')->_buildNameTable(2)" is OK. At first glance, this method bubbles up looking for resources, same as "find_prop". It just happens that Pages root or intermediate nodes don't have additional magick attached to them, as in PDF::API2. It's neither good nor bad, just design decision and simply a given, in my opinion. The lib serves its purpose. I can think of hundreds(?) of ways to break a "valid" PDF file, if tinkered with at low-level, so maybe even Adobe Reader would crash ungraciously. Perhaps it's too much "defensive" to anticipate everything which is tiny percent of cases, sometimes try/catch at library user side is OK. Maybe I'm wrong. And, yes, I did my lot of in-house patching to deal with in-flow of broken PDF files from same "dear" customers. It's just these patches are not universal.
Subject: Re: [rt.cpan.org #130722] Crash during ->openpage
Date: Thu, 7 Nov 2019 11:00:06 +0000
To: bug-PDF-API2 [...] rt.cpan.org
From: joe <joe [...] printevolved.co.uk>
Show quoted text
> And, yes, I did my lot of in-house patching to deal with in-flow of broken PDF files from same "dear" customers. It's just these patches are not universal.
I can understand that point of view 100%. I've always thought this was a faulty PDF. The only reason we tried to get this in, is A) we have patched this on our side so we'd ideally not like to stay on a divergent path, but also B) because acrobat & everything else we tried worked with this PDF and PDF::API2 did not. I know that's fairly arbitrary! In the long term I'm not even sure we'll see another PDF broken this way. Given all this I'd not be put out at all if you close my bug and my pull request, I just thought it worth raising in case wiser minds than mine found any value here. I don't think that's the case, except for as I say the entropy of the universe can make a PDF which 90% of PDF parsers will read and fix and PDF::API2 crashes on.
Show quoted text
> acrobat & everything else we tried worked with this PDF and PDF::API2 did not.
The nub of this is: what is "works with"? If a Reader (like Acrobat) FIXES UP a broken PDF (and usually, asks permission to save it), that indicates that the original PDF is indeed defective. I would not call that "works with" ("tolerates", perhaps). At best, everything but PDF::API2 (and Builder) successfully fix a broken PDF. I will be happy to add detection (with message) and fixup to PDF::Builder if I can find what the problem is (just a bad parent? always blame parenting!). Show quoted text
> make a PDF which 90% of PDF parsers will read and fix and PDF::API2 crashes on.
Again, a broken PDF is a broken PDF. It should not be silently passed or fixed up, but the user should at least be notified that something fishy was found (in case further problems arise from either the flaw or the fix). Vadim's parental fixup is OK, *if* at first the error can be found and reported (perhaps check links to all Kids, and see if all report back to the same Parent?).
Following Vadim's lead, I followed the child/parent leads through afhacked2.pdf. It's a mess. Object 4 (Pages) points to its children Page objects 5 and 6. Object 5 points back to 4 as its Parent, while 6 points back to 9! It should also be 4. I changed object 6's Parent from 9 to 4, and got rid of objects 9 and 13, which weren't used by anyone. When run with your openpage() code, I got a complaint about an undefined $dat in FlateDecode.pm, so I got rid of /Contents 8 0 R in object 6, and object 8 (a FlateDecode Filter of Length 0). Now it runs without error message. So, what happened to this PDF? If this was just the result of someone manually editing a PDF file, or a home-brew modifier, I'm reluctant to add any code to check for these link problems, as it's probably a one-off. On the other hand, if a commercial product produced the bad PDF (and no hand editing was done), it might be worth adding checks and fixes for this problem, as it will likely be fairly widespread. Anyway, could you look inside the original PDF with an editor (such as gViM) and follow the Kids and Parents links, and see where they went off the rails? Is this PDF untouched from the time some commercial product produced it, or has someone messed with it?