Bug #42819 for CAM-PDF: parseStream() fix for unrecogized stream tag

Tue Jan 27 01:51:50 2009 vbmetta [...] hotmail.com - Ticket created

Subject:	parseStream() fix for unrecogized stream tag
Date:	Tue, 27 Jan 2009 06:48:11 +0000
To:	<bug-cam-pdf [...] rt.cpan.org>
From:	Vonne Bannavong <vbmetta [...] hotmail.com>

CAM::PDF->parseStream() matches the beginning of a stream with my $begin = shift || qr/ stream\r?\n /xms; Customer account statements provided by a major stock brokerage as PDF-1.2 files, produced using a commercial package /Author (Xenos, inc.) /CreationDate () /Creator (M2PD API Version 4.0.01, build$Oct 17 2002, 10:52:03$) /Producer (PDFOUT v3.8t by Xenos, inc.) begins streams with qr/ stream\s\n /xms These files are readily viewed by a variety of PDF viewers, so they are probably compliant with the PDF spec. However CAM::PDF fails to read these files, stopping at the stream tag. This issue is corrected by changing the above line of code in parseStream() to my $begin = shift || qr/ stream(\r?|\s)\n /xms; The whitespace before the newline probably generalizes to \s* I hope you can apply this patch to CPAN. I like your package - thanks. Robert Lacroix Show quoted text

_________________________________________________________________ Windows Live Messenger. Multitasking at its finest. http://www.microsoft.com/windows/windowslive/messenger.aspx

Tue Jan 27 22:22:05 2009 chris+rt [...] chrisdolan.net - Status changed from 'new' to 'rejected'

Tue Jan 27 22:22:11 2009 chris [...] chrisdolan.net - Correspondence added

Subject:	Re: [rt.cpan.org #42819] parseStream() fix for unrecogized stream tag
Date:	Tue, 27 Jan 2009 21:20:38 -0600
To:	bug-CAM-PDF [...] rt.cpan.org
From:	Chris Dolan <chris [...] chrisdolan.net>

Robert, I thank you for the thoughtful comment, but your patch is not correct. If I applied that patch, then streams that started with a "\n" would be missing their initial character(s). The PDF specification is very clear on this issue: "The keyword 'stream' that follows the stream dictionary should be followed by an end-of-line marker consisting of either a carriage return and a line feed or just a line feed, and not by a carriage return alone. The sequence of bytes that make up a stream lie between the 'stream' and 'endstream' keywords; the stream dictionary specifies the exact number of bytes. It is recommended that there be an end-of-line marker after the data and before 'endstream'; this marker is not included in the stream length." (PDF 1.7 reference, pp 60-61) You are obviously welcome to use the patch in your own code, but I recommend that you check the results carefully because, given random binary streams, your patch will work in 255 or every 256 cases. Chris On Jan 27, 2009, at 12:51 AM, Vonne Bannavong via RT wrote: Show quoted text

> Tue Jan 27 01:51:50 2009: Request 42819 was acted upon. > Transaction: Ticket created by vbmetta@hotmail.com > Queue: CAM-PDF > Subject: parseStream() fix for unrecogized stream tag > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: vbmetta@hotmail.com > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=42819 > > > > > CAM::PDF->parseStream() matches the beginning of a stream with > my $begin = shift || qr/ stream\r?\n /xms; > > Customer account statements provided by a major stock brokerage as > PDF-1.2 files, > produced using a commercial package > /Author (Xenos, inc.) > > /CreationDate () > > /Creator (M2PD API Version 4.0.01, build$Oct 17 2002, 10:52:03$) > > /Producer (PDFOUT v3.8t by Xenos, inc.) > > > begins streams with > qr/ stream\s\n /xms > > These files are readily viewed by a variety of PDF viewers, so they > are probably compliant > with the PDF spec. However CAM::PDF fails to read these files, > stopping at the stream tag. > This issue is corrected by changing the above line of code in > parseStream() to > my $begin = shift || qr/ stream(\r?|\s)\n /xms; > The whitespace before the newline probably generalizes to \s* > > I hope you can apply this patch to CPAN. > I like your package - thanks. > > Robert Lacroix > > _________________________________________________________________ > Windows Live Messenger. Multitasking at its finest. > http://www.microsoft.com/windows/windowslive/messenger.aspx > CAM::PDF->parseStream() matches the beginning of a stream with > my $begin = shift || qr/ stream\r?\n /xms; > > Customer account statements provided by a major stock brokerage as > PDF-1.2 files, > produced using a commercial package > /Author (Xenos, inc.) > /CreationDate () > /Creator (M2PD API Version 4.0.01, build$Oct 17 2002, 10:52:03$) > /Producer (PDFOUT v3.8t by Xenos, inc.) > > begins streams with > qr/ stream\s\n /xms > > These files are readily viewed by a variety of PDF viewers, so they > are probably compliant > with the PDF spec. However CAM::PDF fails to read these files, > stopping at the stream tag. > This issue is corrected by changing the above line of code in > parseStream() to > my $begin = shift || qr/ stream(\r?|\s)\n /xms; > The whitespace before the newline probably generalizes to \s* > > I hope you can apply this patch to CPAN. > I like your package - thanks. > > Robert Lacroix > > Windows Live Messenger. Multitasking at its finest.

Tue Jan 27 22:22:12 2009 The RT System itself - Status changed from 'rejected' to 'open'

Fri Jan 30 04:32:10 2009 vbmetta [...] hotmail.com - Correspondence added

Subject:	RE: [rt.cpan.org #42819] parseStream() fix for unrecogized stream tag
Date:	Fri, 30 Jan 2009 09:27:25 +0000
To:	<bug-cam-pdf [...] rt.cpan.org>
From:	Vonne Bannavong <vbmetta [...] hotmail.com>

Chris, You're absolutely right - I overlooked that \n is matched by \s. Thanks, I'd rather it works 100% of the time. What we really need to match is "whitespace except \n". I looked up the PDF 1.7 spec online in Adobe's ISO-approved copy of ISO 32000-1:2008 found at http://www.adobe.com/devnet/pdf/pdf_reference.html Your referenced passage occurs in section "7.3.8 Stream Objects" ... "7.3.8.1 General". However the wording is slightly different from the version you quoted. In fact it makes the end-of-line before the stream data bytes a requirement. I quote (and add emphasis on notable differences): The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes. There should be an end-of-line marker after the data and before endstream; this marker shall not be included in the stream length. There shall not be any extra bytes, other than white space, between endstream and endobj. http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf (C) Adobe Systems Incorporated 2008 (retrieved 2009-01-29) Section 4.46 says that whitespace separates PDF syntactic constructs, and newlines are included in whitespace. What we have with the "stream" keyword is a special case where the first newline that appears after it, in addition to being whitespace in relation to keywords, also acts as the delimiter for the data bytes of the stream. My assertion is that before hitting the newline, PDF syntax is skipping whitespace in keyword context, and also looking for that first newline to start the data stream. I've had no reason to ever consider this, except that now I've seen one commerical PDF writer running with this interpretation (or maybe it was just a stupid mainframe character conversion... \r into space; who knows?). Implementations aside, I would bet that the PDF spec designers would agree that PDF parsing returns to keyword scanning mode after the stream's byte count is exhausted, even though the only valid keyword is "endstream". The section quoted above hints at this, saying a newline is ignored if it appears before "endstream", and allowing any whitespace before "endobj". The section's encouragement to add an end-of-line after the stream data and before "endstream" is motivated by the desire to aid in recovery of corrupted PDF files (I saw it somewhere else in the spec but now evince !~ /.*/). Getting back to the regex we need, section 4.46 lists valid whitespace characters: TAB (09h) LINE FEED (0Ah) FORM FEED (0Ch) CARRIAGE RETURN (0Dh) and the venerable SPACE (20h) but we want to ignore LINE FEED, so how about my $begin = shift || qr/ stream[\f\r[:space:]]*\n /xms; (since [:space:] is equivalent to [ /t]). I tried it and it works on a few largish PDFs plus these formerly unreadable statements. Do you have any specially constructed unit test cases? Robert Show quoted text

> Subject: Re: [rt.cpan.org #42819] parseStream() fix for unrecogized stream tag > From: bug-CAM-PDF@rt.cpan.org > To: vbmetta@hotmail.com > Date: Tue, 27 Jan 2009 22:22:12 -0500 > > <URL: https://rt.cpan.org/Ticket/Display.html?id=42819 > > > Robert, > > I thank you for the thoughtful comment, but your patch is not > correct. If I applied that patch, then streams that started with a > "\n" would be missing their initial character(s). The PDF > specification is very clear on this issue: > > "The keyword 'stream' that follows > the stream dictionary should be followed by an end-of-line marker > consisting of > either a carriage return and a line feed or just a line feed, and > not by a carriage > return alone. The sequence of bytes that make up a stream lie > between the 'stream' > and 'endstream' keywords; the stream dictionary specifies the > exact number of > bytes. It is recommended that there be an end-of-line marker > after the data and > before 'endstream'; this marker is not included in the stream > length." > > (PDF 1.7 reference, pp 60-61) > > You are obviously welcome to use the patch in your own code, but I > recommend that you check the results carefully because, given random > binary streams, your patch will work in 255 or every 256 cases. > > Chris

Show quoted text

_________________________________________________________________ Twice the fun—Share photos while you chat with Windows Live Messenger. http://www.microsoft.com/windows/windowslive/messenger.aspx

Fri Jan 30 09:16:08 2009 vbmetta [...] hotmail.com - Correspondence added

Subject:	RE: [rt.cpan.org #42819] parseStream() fix for unrecogized stream tag
Date:	Fri, 30 Jan 2009 14:11:51 +0000
To:	<bug-cam-pdf [...] rt.cpan.org>
From:	Vonne Bannavong <vbmetta [...] hotmail.com>

The emphasis I added to the quote in the previous message was lost when it was posted. Here it is again with text... The keyword stream that follows the stream dictionary ==>shall<== be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between ==>the end-of-line marker following the stream keyword<== and the endstream keyword; the stream dictionary specifies the exact number of bytes. ==>There should be<== an end-of-line marker after the data and before endstream; this marker ==>shall not be<== included in the stream length. There shall not be any extra bytes, other than white space, between endstream and endobj. http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf (C) Adobe Systems Incorporated 2008 (retrieved 2009-01-29) Show quoted text

_________________________________________________________________ So many new options, so little time. Windows Live Messenger. http://www.microsoft.com/windows/windowslive/messenger.aspx

Sat Jan 31 03:59:12 2009 chris [...] chrisdolan.net - Correspondence added

Subject:	Re: [rt.cpan.org #42819] parseStream() fix for unrecogized stream tag
Date:	Sat, 31 Jan 2009 02:58:20 -0600
To:	bug-CAM-PDF [...] rt.cpan.org
From:	Chris Dolan <chris [...] chrisdolan.net>

Hmm, yeah, that makes sense. I had not interpreted it that way, but I think it's worth applying your patch after all. Chris On Jan 30, 2009, at 8:16 AM, Vonne Bannavong via RT wrote: Show quoted text

> Queue: CAM-PDF > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=42819 > > > > The emphasis I added to the quote in the previous message was lost > when it was posted. > > Here it is again with text... > > > The keyword stream that follows the stream dictionary > ==>shall<== be followed by an end-of-line marker > consisting of either a CARRIAGE RETURN and a LINE FEED or > just a LINE FEED, and not by a CARRIAGE > RETURN alone. The sequence of bytes that make up a stream lie > between ==>the end-of-line marker following the > stream keyword<== and the endstream keyword; the stream > dictionary specifies the exact number of bytes. ==>There > should > be<== an end-of-line marker after the data and before endstream; > this marker ==>shall not be<== included in the > stream length. There shall not be any extra bytes, other than > white space, between endstream and endobj. > > > > http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf (C) > Adobe Systems Incorporated 2008 (retrieved 2009-01-29) >

Sat Jan 31 10:58:10 2009 chris+rt [...] chrisdolan.net - Correspondence added

This will be in the next CAM::PDF release. Thanks!

Sat Jan 31 10:58:11 2009 chris+rt [...] chrisdolan.net - Status changed from 'open' to 'resolved'