Bug #106020 for PDF-API2: Bug with recognizing PDF files via open

Wed Jul 22 09:56:37 2015 dearly [...] scenariolearning.com - Ticket created

Subject:	Bug with recognizing PDF files via open_scalar
Date:	Wed, 22 Jul 2015 09:56:23 -0400
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Douglas Early <dearly [...] scenariolearning.com>

Ran into this bug when attempting to import a large(ish) number of PDFs stored as scalar data in a databse - about 72 PDFs in all. Most of them work fine but a few are not recognized as being valid PDFs despite rendering just fine in browsers or with Acrobat. The error message is as follows: *GLOB(0xd837530) not a PDF file version 1.x at /home/dearly/git-working/document/Document/script/../local/lib/perl5/PDF/API2/Basic/PDF/File.pm at line 241* The head of the file (retrieved in the variable buffer) looks like this *%PDF-1.4 ▒P2 0 obj <</Length 3 0 R /Filter /FlateDecode >> stream* *Q0T0BC3c#c3▒▒\▒>y* *endstream endobj 3 0 obj 31 endobj 4 0 obj <</Width 2544 /Height 3300 /BitsPerComponent 1 /Subtype /Image /Type /XObject /ColorSpace/DeviceGray /Lengf32b8e','8867a55c-5513-4bce-b2dd-700950cee8cb'* I noticed that removing the $cr variable from the regex on line 240 that tests for validity allows the file to pass. Perhaps $cr needs amended or simply removed from the regex patter? Cheers, -- Doug EarlySoftware Developer Scenario Learning *o. * 800.434.0154 *f. * 513.366.4074 ScenarioLearning.com <http://scenariolearning.com/> <https://www.facebook.com/scenariolearning> <https://www.linkedin.com/company/scenario-learning> <https://twitter.com/SafeSchoolsNews>

Wed Jul 22 12:26:44 2015 steve [...] deefs.net - Correspondence added

Hi Doug, Interesting. In this case, the error message is correct -- according to the spec (section 7.5.2), the first line of the file may only contain the header (%PDF-1.#), and the second line needs to be a comment line if there are any characters that aren't 7-bit ASCII set (I'm pretty sure PDF::API2 doesn't check for that), so the file isn't a valid PDF. If changing the regex works for you, feel free to keep the change, but I would expect the file to have issues (perhaps not obvious ones) in other readers as well, given what you've shown me. Steve

Wed Jul 22 12:26:44 2015 The RT System itself - Status changed from 'new' to 'open'

Wed Jul 22 12:26:48 2015 steve [...] deefs.net - Status changed from 'open' to 'rejected'

Thu Jul 23 09:06:14 2015 dearly [...] scenariolearning.com - Correspondence added

Subject:	[rt.cpan.org #106020]
Date:	Thu, 23 Jul 2015 09:05:47 -0400
To:	bug-pdf-api2 [...] rt.cpan.org
From:	Douglas Early <dearly [...] scenariolearning.com>

Good Morning! Thanks for the swift reply. I think for now I will simply leave in the modification I mentioned previously. For our use case we are more concerned with being able to stitch and display the PDFs we've been given (managing an inventory of chemical safety data sheets) than with strict conformance. Would there be any interest in a submitted patch that would allow the PDF::API2 object to be instantiated with a validation setting? Perhaps something akin to *validation => 'strict'* or *validation -> 'compatability'* That would control how strictly the PDF is validated? I ask because the language in the PDF spec regarding the second line: *If a PDF file contains binary data, as most do (see Section 3.1, “Lexical Conventions”), it is recommended that the header line be immediately followed by a * *comment line containing at least four binary characters* Makes it sound like the second line is a suggestion rather than a hard requirement (unlike the first line which is fairly explicit in being non-optional). I would be willing to code and submit a patch for that if it you feel that it would contribute something to the project. Cheers, -- Doug EarlySoftware Developer Scenario Learning *o. * 800.434.0154 *f. * 513.366.4074 ScenarioLearning.com <http://scenariolearning.com/> <https://www.facebook.com/scenariolearning> <https://www.linkedin.com/company/scenario-learning> <https://twitter.com/SafeSchoolsNews>

Bug #106020 for PDF-API2: Bug with recognizing PDF files via open_scalar