Chris,
You're absolutely right - I overlooked that \n is matched by \s.
Thanks, I'd rather it works 100% of the time.
What we really need to match is "whitespace except \n".
I looked up the PDF 1.7 spec online in Adobe's ISO-approved copy of ISO 32000-1:2008
found at
http://www.adobe.com/devnet/pdf/pdf_reference.html
Your referenced passage occurs in section "7.3.8 Stream Objects" ... "7.3.8.1 General".
However the wording is slightly different from the version you quoted. In fact it makes the end-of-line
before the stream data bytes a requirement. I quote (and add emphasis on notable differences):
The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker
consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE
RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the
stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes. There
should be an end-of-line marker after the data and before endstream; this marker shall not be included in the
stream length. There shall not be any extra bytes, other than white space, between endstream and endobj.
http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf (C) Adobe Systems Incorporated 2008 (retrieved 2009-01-29)
Section 4.46 says that whitespace separates PDF syntactic constructs, and newlines are included in whitespace.
What we have with the "stream" keyword is a special case where the first newline that appears after it, in addition to
being whitespace in relation to keywords, also acts as the delimiter for the data bytes of the stream. My assertion is that
before hitting the newline, PDF syntax is skipping whitespace in keyword context, and also looking for that first newline
to start the data stream. I've had no reason to ever consider this, except that now I've seen one commerical PDF writer
running with this interpretation (or maybe it was just a stupid mainframe character conversion... \r into space; who knows?).
Implementations aside, I would bet that the PDF spec designers would agree that PDF parsing returns to keyword scanning
mode after the stream's byte count is exhausted, even though the only valid keyword is "endstream". The section quoted above
hints at this, saying a newline is ignored if it appears before "endstream", and allowing any whitespace before "endobj".
The section's encouragement to add an end-of-line after the stream data and before "endstream" is motivated by the desire to
aid in recovery of corrupted PDF files (I saw it somewhere else in the spec but now evince !~ /.*/).
Getting back to the regex we need, section 4.46 lists valid whitespace characters:
TAB (09h)
LINE FEED (0Ah)
FORM FEED (0Ch)
CARRIAGE RETURN (0Dh)
and the venerable SPACE (20h)
but we want to ignore LINE FEED, so how about
my $begin = shift || qr/ stream[\f\r[:space:]]*\n /xms;
(since [:space:] is equivalent to [ /t]). I tried it and it works on a few largish PDFs
plus these formerly unreadable statements. Do you have any specially constructed unit test cases?
Robert
Show quoted text> Subject: Re: [rt.cpan.org #42819] parseStream() fix for unrecogized stream tag
> From: bug-CAM-PDF@rt.cpan.org
> To: vbmetta@hotmail.com
> Date: Tue, 27 Jan 2009 22:22:12 -0500
>
> <URL:
https://rt.cpan.org/Ticket/Display.html?id=42819 >
>
> Robert,
>
> I thank you for the thoughtful comment, but your patch is not
> correct. If I applied that patch, then streams that started with a
> "\n" would be missing their initial character(s). The PDF
> specification is very clear on this issue:
>
> "The keyword 'stream' that follows
> the stream dictionary should be followed by an end-of-line marker
> consisting of
> either a carriage return and a line feed or just a line feed, and
> not by a carriage
> return alone. The sequence of bytes that make up a stream lie
> between the 'stream'
> and 'endstream' keywords; the stream dictionary specifies the
> exact number of
> bytes. It is recommended that there be an end-of-line marker
> after the data and
> before 'endstream'; this marker is not included in the stream
> length."
>
> (PDF 1.7 reference, pp 60-61)
>
> You are obviously welcome to use the patch in your own code, but I
> recommend that you check the results carefully because, given random
> binary streams, your patch will work in 255 or every 256 cases.
>
> Chris
Show quoted text