Skip Menu |

This queue is for tickets about the PDF-API2 CPAN distribution.


Subject: Can't call method "outobjdeep" in 2.026
Hi, I'm using the library on a simple script to update the info structure of pdf files. References: Perl version: strawberry version 5.18.2 OS: Windows 7 Enterprise PDF input files are produced by a java program with PDF version 1.4 (and I never had problem with these). The issue happens sometimes, when users add annotation with Acrobat Reader and this implies that the PDF version becomes 1.7 (according to the Acrobat Reader that they use) After this modification we get the known old bug #48683, as we are using PDF-API2-2.021 I've just tried the new released PDF-API2-2.026 to check actual evolution and I obtain a new error message: << Can't call method "outobjdeep" on an undefined value at D:/tm_programs/perl_portable_pdf/perl/site/lib/PDF/API2/Basic/PDF/Objind.pm line 170. Show quoted text
>>
Below an extract from my sample script: << my $pdf = PDF::API2->open($source) or die "Can't open PDF file $source: $!"; my $nowDate = strftime( "%Y%m%d%H%M%S", localtime()); my %h = $pdf->info( 'CreationDate' => $nowDate, ); $pdf->saveas($source); Show quoted text
>>
As this is my first time reporting a bug, please apologize for any mistake.
Are you able to attach a PDF that demonstrates this problem? If you'd rather it not be publicly visible, you can instead send one to me privately.
Subject: Re: [rt.cpan.org #112932] Can't call method "outobjdeep" in 2.026
Date: Thu, 17 Mar 2016 16:33:03 +0000
To: bug-PDF-API2 [...] rt.cpan.org
From: Francesco Fiorentino <profires [...] gmail.com>
In attachment a sample.pdf where the attached perl script (test.pl) works correctly and a modified one (sampleMod.pdf) where I have the listed error message. The sampleMod is obtained adding an highlight with Adobe Reader XI and saving it. On Tue, 15 Mar 2016 at 20:08 Steve Simms via RT <bug-PDF-API2@rt.cpan.org> wrote: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=112932 > > > Are you able to attach a PDF that demonstrates this problem? If you'd > rather it not be publicly visible, you can instead send one to me privately. >
Download sample.pdf
application/pdf 5.2k

Message body not shown because it is not plain text.

Download sampleMod.pdf
application/pdf 9.3k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Subject: Re: [rt.cpan.org #112932] Can't call method "outobjdeep" in 2.026
Date: Tue, 26 Apr 2016 08:07:25 +0000
To: bug-PDF-API2 [...] rt.cpan.org
From: Francesco Fiorentino <profires [...] gmail.com>
Hi, with the 2.027 released, I see that the error message is no more present, but, using the same input attached previously, it produces an unreadable file. Thanks, Francesco On Thu, 17 Mar 2016 at 17:33 Francesco Fiorentino <profires@gmail.com> wrote: Show quoted text
> In attachment a sample.pdf where the attached perl script (test.pl) works > correctly and a modified one (sampleMod.pdf) where I have the listed error > message. > The sampleMod is obtained adding an highlight with Adobe Reader XI and > saving it. > > > On Tue, 15 Mar 2016 at 20:08 Steve Simms via RT <bug-PDF-API2@rt.cpan.org> > wrote: >
>> <URL: https://rt.cpan.org/Ticket/Display.html?id=112932 > >> >> Are you able to attach a PDF that demonstrates this problem? If you'd >> rather it not be publicly visible, you can instead send one to me privately. >>
>
Subject: Re: [rt.cpan.org #112932] Can't call method "outobjdeep" in 2.026
Date: Wed, 01 Jun 2016 14:12:03 +0000
To: bug-PDF-API2 [...] rt.cpan.org
From: Francesco Fiorentino <profires [...] gmail.com>
Any feedback about that? On Tue, 26 Apr 2016 at 10:07 Francesco Fiorentino <profires@gmail.com> wrote: Show quoted text
> Hi, > > with the 2.027 released, I see that the error message is no more present, > but, using the same input attached previously, it produces an unreadable > file. > > Thanks, > Francesco > > > On Thu, 17 Mar 2016 at 17:33 Francesco Fiorentino <profires@gmail.com> > wrote: >
>> In attachment a sample.pdf where the attached perl script (test.pl) >> works correctly and a modified one (sampleMod.pdf) where I have the listed >> error message. >> The sampleMod is obtained adding an highlight with Adobe Reader XI and >> saving it. >> >> >> On Tue, 15 Mar 2016 at 20:08 Steve Simms via RT <bug-PDF-API2@rt.cpan.org> >> wrote: >>
>>> <URL: https://rt.cpan.org/Ticket/Display.html?id=112932 > >>> >>> Are you able to attach a PDF that demonstrates this problem? If you'd >>> rather it not be publicly visible, you can instead send one to me privately. >>>
>>
On Wed Jun 01 10:12:24 2016, profires@gmail.com wrote: Show quoted text
> Any feedback about that?
I suspect that it's the same issue as ticket #113293.
On Thu Jun 02 09:55:22 2016, SSIMMS wrote: Show quoted text
> On Wed Jun 01 10:12:24 2016, profires@gmail.com wrote:
> > Any feedback about that?
> > I suspect that it's the same issue as ticket #113293.
Actually, the issue seems unrelated. End of the modified PDF: startxref 116 %%EOF 8 0 obj << /CreationDate (20160607205416) /Creator (Apache FOP Version 1.1) /ModDate (D:20160317171139+01'00') /PDFVersion (1.4) /Producer (Apache FOP Version 1.1) >> endobj xref 0 1 0000000000 65535 f 8 1 0000009549 00000 n trailer << /Type /XRef /DecodeParms << /Columns 4 /Predictor 12 >> /Filter /FlateDecode /ID [ <951086a159fa774291c81f007ad52c0e> <d0fd218e4aa35740b313e56bfd43b2db> ] /Index [ 9 18 ] /Info 8 0 R /Length 60 /Prev 116 /Root 10 0 R /Size 1 /W [ 1 2 1 ] >> startxref 9723 %%EOF It looks like the code just appends this code and keeps the original PDF verbatim, at first glance, hence the breakage.
Subject: Simply opening and saving a multipage PDF file corrupts the file
Date: Wed, 24 Aug 2016 11:06:30 +0200
To: bug-PDF-API2 [...] rt.cpan.org
From: Dietrich Streifert <dietrich.streifert [...] googlemail.com>
This is for perl 5.16 on centos 7.2 using a simple test file (filename "test.pdf" ) with four pages: The following code my $pdf = PDF::API2->open("test.pdf"); $pdf->saveas("test-mod.pdf"); $pdf->end; generates a corrupt file "test-mod.pdf" which is not readable any more by e.g. Acrobat Reader, which reports that the document can not be opened (code 14). This behaviour makes PDF::API2 unusable for even the simplest modifications. I've attached both the perl code and the test file (don't know if this gets through the email bug submission at rt.cpan.org)
Download test.pdf
application/pdf 26.5k

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Subject: [rt.cpan.org #117184]
Date: Wed, 24 Aug 2016 10:12:45 -0400
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry <philperry [...] hvc.rr.com>
I see that the original test.pdf is PDF version 1.5. Maybe there's something in there that got corrupted when reading into PDF::API2. Is it possible to create your test.pdf in version 1.4 or even 1.3? Admittedly that's not a great solution -- PDF::API2 needs to be brought into the 21st century and handle up to version 1.7 correctly -- but it may do for the time being.
Subject: Re: [rt.cpan.org #117184]
Date: Wed, 24 Aug 2016 16:27:45 +0200
To: bug-PDF-API2 [...] rt.cpan.org
From: Dietrich Streifert <dietrich.streifert [...] googlemail.com>
You're right! It works if I convert test.pdf to PDF-Version 1.4.
PDF::API2 got support for reading files with cross-reference streams in version 2.026, but it doesn't yet support writing those files. The easiest way to implement this would be to convert the object stream to regular objects and save the file normally. That would eliminate the need to teach PDF::API2 how to write a cross-reference stream, though that's the other option. Doing so will typically produce a file that's a little smaller, but it isn't necessary. As a workaround until someone adds that support, you can use importPageIntoForm to copy each page into a new PDF file, or use other copy methods to get the data from the original file to a new one.
On Thu Jun 02 09:55:22 2016, SSIMMS wrote: Show quoted text
> On Wed Jun 01 10:12:24 2016, profires@gmail.com wrote:
> > Any feedback about that?
> > I suspect that it's the same issue as ticket #113293.
Ok, not #113293, but it does appear to be the same as #117184. sampleMod.pdf contains a cross-reference stream. PDF::API2 can read them as of version 2.026, but it doesn't know how to write a cross-reference stream yet, nor how to convert from a cross-reference stream to a cross-reference table (which would likely be the easier of the two to implement). A potential solution and a workaround are given in ticket #117184.
Possible solution in ticket 121832.
Ticket 121832 is marked as fixed (resolved), but I don't think Vadim's code was put in, and I don't think the current PDF::API2 (nor PDF::Builder) can deal with writing back out a PDF 1.5 cross-reference stream. I don't know for sure what was "fixed" in that ticket. Perhaps it would be a good time to take another look at either writing out a cross-reference stream or converting it to a classic xref table. In PDF::Builder, the cross-reference stream output would automatically bump the PDF version to 1.5 (simply reading in such a PDF in the first place will also do so). I have no problems with doing that -- on the other hand, is there a strong argument for converting to an xref table, to stay at PDF 1.4? Cross-reference streams, once read in, seem to be causing more and more trouble, so it would be good to deal with them once and for all.
Phil, I have better alternative than patch (hack) from #121832. To please Acrobat/Reader, incremental update can append either classical Xref Table or compressed Xref Stream. The new patch seems to be working. The test PDF file is from this thread. 1) Producing "hybrid files" to ensure "compatibility with older applications" is not implemented (was not even contemplated -- I don't think it's important anymore). 2) No support (with this patch, but would not be difficult in general) for files > ~4 Gb. 3) Somewhat lousy compression (because of no prediction) if someone updates unusually large number of objects -- i.e. generally unlikely). 4) Of course, updated objects are not stuffed into streams, and furthermore this patch does nothing to "use modern compression" when file is clean-output (IIRC, PDF::API2 can't do it anyway). 5) Important -- this patch also applies changes (2 topmost changes) as per #121911. In fact, fixes are very minimal, existing code is mostly re-used to collect updates made to XRef Table (instead of writing them as they come) and then apply them appropriately in either of 2 modes. + One (minor) digression: documentation could be more clear that after calling "saveas" an instance becomes unusable -- to prevent someone writing scripts e.g. such as with commented fragment below. use warnings; use strict; use feature 'say'; use PDF::API2; my $pdf = PDF::API2-> open( "test.pdf" ); $pdf-> page; $pdf-> page; $pdf-> page; $pdf-> saveas( "test-mod.pdf" ); # $pdf-> page; # $pdf-> page; # $pdf-> saveas( "test-mod++.pdf" ); __END__
Subject: File.diff.190403.txt
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\File.pm Wed Apr 3 04:01:26 2019 @@ -522,6 +522,7 @@ if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) { bless $result, $types{$result->{'Type'}->val}; + $result-> {' outto'} = [ $self ]; } # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. @@ -540,7 +541,7 @@ } $result->{' parent'} = $self; weaken $result->{' parent'}; - $result->{' realised'} = 0; +#?? $result->{' realised'} = 0; # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. } @@ -1282,7 +1283,7 @@ $tdict->{'Size'} = PDFNum($self->{' maxobj'}); my $tloc = $fh->tell(); - $fh->print("xref\n"); + my @out; my @xreflist = sort { $self->{' objects'}{$a->uid}[0] <=> $self->{' objects'}{$b->uid}[0] } (@{$self->{' printed'} || []}, @{$self->{' free'} || []}); @@ -1314,25 +1315,25 @@ # $fh->printf("0 1\n%010d 65535 f \n", $ff); # } if ($i > $#xreflist || $self->{' objects'}{$xreflist[$i]->uid}[0] != $j + 1) { - $fh->print(($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n"); + push @out, ($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n"; if ($first == -1) { - $fh->printf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0); + push @out, sprintf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0); $first = 0; } for ($j = $first; $j < $i; $j++) { my $xref = $xreflist[$j]; if (defined $freelist[$k] && defined $xref && "$freelist[$k]" eq "$xref") { $k++; - $fh->print(pack("A10AA5A4", + push @out, pack("A10AA5A4", sprintf("%010d", (defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0)), " ", sprintf("%05d", $self->{' objects'}{$xref->uid}[1] + 1), - " f \n")); + " f \n"); } else { - $fh->print(pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ", + push @out, pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ", sprintf("%05d", $self->{' objects'}{$xref->uid}[1]), - " n \n")); + " n \n"); } } $first = $i; @@ -1342,9 +1343,48 @@ $j++; } } - $fh->print("trailer\n"); - $tdict->outobjdeep($fh, $self); - $fh->print("\nstartxref\n$tloc\n%%EOF\n"); + if ( exists $tdict-> { Type } and $tdict-> { Type }-> val eq 'XRef' ) { + + my ( @index, @stream ); + my $len = 2; # 2 or 4 will do + for ( @out ) { + $_ = [ split ]; + die if $_-> [ 0 ] >= 0xFFFFFFFF; # extremely unlikely, but better (any?) message would help + $len = 4 if $_-> [ 0 ] >= 0xFFFF; + @$_ == 2 ? push @index, @$_ : push @stream, $_ + } + my $c = $len == 2 ? 'n' : 'N'; + my $stream = join '', map { + pack "C${c}C", $_-> [ 2 ] eq 'n' ? 1 : 0, @{ $_ }[ 0 .. 1 ] + } @stream; + + $i = $self->{ ' maxobj' } ++; + $self-> add_obj( $tdict, $i, 0 ); + $self-> out_obj( $tdict ); + + push @index, $i, 1; + $stream .= pack "C${c}C", 1, $tloc, 0; + + $tdict-> { Size } = PDFNum( ++ $i ); + $tdict-> { Index } = PDFArray( map PDFNum( $_ ), @index ); + $tdict-> { W } = PDFArray( map PDFNum( $_ ), 1, $len, 1 ); + $tdict-> { Filter } = PDFName( 'FlateDecode' ); + + delete $tdict-> { DecodeParms }; # For such streams, prediction improves compression hugely, + # but "outfilt" just can't do it, alas. + + $stream = PDF::API2::Basic::PDF::Filter::FlateDecode-> new-> outfilt( $stream, 1 ); + $tdict-> { ' stream' } = $stream; + $tdict-> { ' nofilt' } = 1; + delete $tdict-> { Length }; + $self-> ship_out; + } + else { + $fh->print("xref\n", @out, "trailer\n"); + $tdict->outobjdeep($fh, $self); + $fh->print("\n"); + } + $fh->print("startxref\n$tloc\n%%EOF\n"); }
Should have chosen offset length (2 or 4 bytes) based on $tloc only. Fixed. Also, added filtering to XRef stream. Raw (uncompressed) stream length will grow up to 25% (as with file being tested) because of prepended byte per "row", but for any substantial changes to PDF file, compression ratio will improve significantly. E.g., if, in example script, 6 instead of 3 pages are appended, compressed stream length already becomes 42 vs. 44 bytes for filtered/unfiltered data. One concern may be that gennum is limited to 1 byte, but, in reality, they haven't been used (and objnums re-used) for a long time. In test file, and all "modern" (with XRef stream) files I've seen, 1st XRef Table entry is "0 0 f". IIRC PDF 2.0 says gennum is always 0.
Subject: File.diff.190403.txt
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\File.pm Wed Apr 3 19:27:37 2019 @@ -522,6 +522,7 @@ if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) { bless $result, $types{$result->{'Type'}->val}; + $result-> {' outto'} = [ $self ]; } # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. @@ -540,7 +541,7 @@ } $result->{' parent'} = $self; weaken $result->{' parent'}; - $result->{' realised'} = 0; +#?? $result->{' realised'} = 0; # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. } @@ -1282,7 +1283,7 @@ $tdict->{'Size'} = PDFNum($self->{' maxobj'}); my $tloc = $fh->tell(); - $fh->print("xref\n"); + my @out; my @xreflist = sort { $self->{' objects'}{$a->uid}[0] <=> $self->{' objects'}{$b->uid}[0] } (@{$self->{' printed'} || []}, @{$self->{' free'} || []}); @@ -1314,25 +1315,25 @@ # $fh->printf("0 1\n%010d 65535 f \n", $ff); # } if ($i > $#xreflist || $self->{' objects'}{$xreflist[$i]->uid}[0] != $j + 1) { - $fh->print(($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n"); + push @out, ($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n"; if ($first == -1) { - $fh->printf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0); + push @out, sprintf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0); $first = 0; } for ($j = $first; $j < $i; $j++) { my $xref = $xreflist[$j]; if (defined $freelist[$k] && defined $xref && "$freelist[$k]" eq "$xref") { $k++; - $fh->print(pack("A10AA5A4", + push @out, pack("A10AA5A4", sprintf("%010d", (defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0)), " ", sprintf("%05d", $self->{' objects'}{$xref->uid}[1] + 1), - " f \n")); + " f \n"); } else { - $fh->print(pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ", + push @out, pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ", sprintf("%05d", $self->{' objects'}{$xref->uid}[1]), - " n \n")); + " n \n"); } } $first = $i; @@ -1342,9 +1343,53 @@ $j++; } } - $fh->print("trailer\n"); - $tdict->outobjdeep($fh, $self); - $fh->print("\nstartxref\n$tloc\n%%EOF\n"); + if ( exists $tdict-> { Type } and $tdict-> { Type }-> val eq 'XRef' ) { + + my ( @index, @stream ); + for ( @out ) { + my @a = split; + @a == 2 ? push @index, @a : push @stream, \@a + } + $i = $self->{ ' maxobj' } ++; + $self-> add_obj( $tdict, $i, 0 ); + $self-> out_obj( $tdict ); + + push @index, $i, 1; + push @stream, [ $i, 0, 'n' ]; + + my $len = $tloc > 0xFFFF ? 4 : 2; # don't expect files > 4 Gb + my $tpl = $tloc > 0xFFFF ? 'CNC' : 'CnC'; # don't expect gennum > 255, it's absurd. + # Adobe doesn't use them anymore anyway + my $stream = ''; + my @prev = ( 0 ) x ( $len + 2 ); + for ( @stream ) { + my @line = unpack 'C*', pack $tpl, $_-> [ 2 ] eq 'n' ? 1 : 0, @{ $_ }[ 0 .. 1 ]; + + $stream .= pack 'C*', 2, # prepend filtering method, "PNG Up" + map {( $line[ $_ ] - $prev[ $_ ] + 256 ) % 256 } 0 .. $#line; + @prev = @line; + } + $tdict-> { Size } = PDFNum( $i + 1 ); + $tdict-> { Index } = PDFArray( map PDFNum( $_ ), @index ); + $tdict-> { W } = PDFArray( map PDFNum( $_ ), 1, $len, 1 ); + $tdict-> { Filter } = PDFName( 'FlateDecode' ); + + $tdict-> { DecodeParms } = PDFDict; + $tdict-> { DecodeParms }-> val-> { Predictor } = PDFNum( 12 ); + $tdict-> { DecodeParms }-> val-> { Columns } = PDFNum( $len + 2 ); + + $stream = PDF::API2::Basic::PDF::Filter::FlateDecode-> new-> outfilt( $stream, 1 ); + $tdict-> { ' stream' } = $stream; + $tdict-> { ' nofilt' } = 1; + delete $tdict-> { Length }; + $self-> ship_out; + } + else { + $fh->print("xref\n", @out, "trailer\n"); + $tdict->outobjdeep($fh, $self); + $fh->print("\n"); + } + $fh->print("startxref\n$tloc\n%%EOF\n"); }
Wow! That's quite a bit of work you've put in -- thank you. It's complicated enough that I want to go over it very carefully (and of course, test it thoroughly) before putting it in PDF::Builder. I can't even yet ask any questions about it! I hope to get it in for release 3.014, unless there are complications, in which case it may slide to 3.015 this summer.
Found minor issues: though harmless, they'd better be fixed. I hope that's final version, sorry for the mess.
Subject: File.diff.190403.txt
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\File.pm Tue Apr 9 00:46:42 2019 @@ -522,6 +522,8 @@ if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) { bless $result, $types{$result->{'Type'}->val}; + $result-> {' outto'} = [ $self ]; + weaken $_ for @{$result->{' outto'}}; } # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. @@ -540,7 +542,7 @@ } $result->{' parent'} = $self; weaken $result->{' parent'}; - $result->{' realised'} = 0; +#?? $result->{' realised'} = 0; # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. } @@ -1282,7 +1284,7 @@ $tdict->{'Size'} = PDFNum($self->{' maxobj'}); my $tloc = $fh->tell(); - $fh->print("xref\n"); + my @out; my @xreflist = sort { $self->{' objects'}{$a->uid}[0] <=> $self->{' objects'}{$b->uid}[0] } (@{$self->{' printed'} || []}, @{$self->{' free'} || []}); @@ -1314,25 +1316,25 @@ # $fh->printf("0 1\n%010d 65535 f \n", $ff); # } if ($i > $#xreflist || $self->{' objects'}{$xreflist[$i]->uid}[0] != $j + 1) { - $fh->print(($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n"); + push @out, ($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n"; if ($first == -1) { - $fh->printf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0); + push @out, sprintf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0); $first = 0; } for ($j = $first; $j < $i; $j++) { my $xref = $xreflist[$j]; if (defined $freelist[$k] && defined $xref && "$freelist[$k]" eq "$xref") { $k++; - $fh->print(pack("A10AA5A4", + push @out, pack("A10AA5A4", sprintf("%010d", (defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0)), " ", sprintf("%05d", $self->{' objects'}{$xref->uid}[1] + 1), - " f \n")); + " f \n"); } else { - $fh->print(pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ", + push @out, pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ", sprintf("%05d", $self->{' objects'}{$xref->uid}[1]), - " n \n")); + " n \n"); } } $first = $i; @@ -1342,9 +1344,53 @@ $j++; } } - $fh->print("trailer\n"); - $tdict->outobjdeep($fh, $self); - $fh->print("\nstartxref\n$tloc\n%%EOF\n"); + if ( exists $tdict-> { Type } and $tdict-> { Type }-> val eq 'XRef' ) { + + my ( @index, @stream ); + for ( @out ) { + my @a = split; + @a == 2 ? push @index, @a : push @stream, \@a + } + my $i = $self->{ ' maxobj' } ++; + $self-> add_obj( $tdict, $i, 0 ); + $self-> out_obj( $tdict ); + + push @index, $i, 1; + push @stream, [ $tloc, 0, 'n' ]; + + my $len = $tloc > 0xFFFF ? 4 : 2; # don't expect files > 4 Gb + my $tpl = $tloc > 0xFFFF ? 'CNC' : 'CnC'; # don't expect gennum > 255, it's absurd. + # Adobe doesn't use them anymore anyway + my $stream = ''; + my @prev = ( 0 ) x ( $len + 2 ); + for ( @stream ) { + my @line = unpack 'C*', pack $tpl, $_-> [ 2 ] eq 'n' ? 1 : 0, @{ $_ }[ 0 .. 1 ]; + + $stream .= pack 'C*', 2, # prepend filtering method, "PNG Up" + map {( $line[ $_ ] - $prev[ $_ ] + 256 ) % 256 } 0 .. $#line; + @prev = @line; + } + $tdict-> { Size } = PDFNum( $i + 1 ); + $tdict-> { Index } = PDFArray( map PDFNum( $_ ), @index ); + $tdict-> { W } = PDFArray( map PDFNum( $_ ), 1, $len, 1 ); + $tdict-> { Filter } = PDFName( 'FlateDecode' ); + + $tdict-> { DecodeParms } = PDFDict; + $tdict-> { DecodeParms }-> val-> { Predictor } = PDFNum( 12 ); + $tdict-> { DecodeParms }-> val-> { Columns } = PDFNum( $len + 2 ); + + $stream = PDF::API2::Basic::PDF::Filter::FlateDecode-> new-> outfilt( $stream, 1 ); + $tdict-> { ' stream' } = $stream; + $tdict-> { ' nofilt' } = 1; + delete $tdict-> { Length }; + $self-> ship_out; + } + else { + $fh->print("xref\n", @out, "trailer\n"); + $tdict->outobjdeep($fh, $self); + $fh->print("\n"); + } + $fh->print("startxref\n$tloc\n%%EOF\n"); }
Hi Vadim, It looks like it's almost there. I did encounter one error message while running your test.pl code: Character in 'C' format wrapped in pack at .../File.pm line 1507. That line is after for (@stream) { : my @line = unpack 'C*', pack $tpl, $_->[2] eq 'n'... Any ideas? I'm running PDF::Builder on Perl 5.26, if you tested at an earlier version. I applied only the changes in your last posting of the diffs to File.pm (they appeared to be cumulative). The old code produced an unloadable corrupted PDF, but the new File.pm code produced a working PDF. It's the original PDF 1.5 input with some new stuff, including a cross reference stream, added after the %%EOF, if that's the correct result. I note that some objects have the same number as those found earlier in the input -- I take it they override (replace) the earlier object of the same number? Should this cross reference stream output be seen ONLY if the file read in was PDF 1.5 or higher, with a cross reference stream? That is, nothing at PDF 1.4 or lower in PDF::Builder should cause a cross reference stream to be output? If something can cause it, I will need to add a line of code to force a minimum of PDF 1.5 output level. Finally, I looked at your complaint about 'saveas()' not permitting further updates. Indeed, after saveas(), $pdf is still defined and is still a hash, but $pdf->page() blows up (can't call method new_obj on an undefined value). Could you look at RT 81530 and see if it sounds related? Possibly $pdf should be marked as unusable, or even be undefined, once save(), saveas(), or stringify() is called? At the least, I can expand the documentation to warn about this.
I think I may have solved the first problem ('wrapped' warning), but I'd like your opinion on it. After the line for (@stream) { and before my @line = unpack 'C*', pack $tpl, $_->[ 2 ] eq 'n'? 1 : 0, @{ $_ }[ 0 .. 1 ]; I added $_->[1] &= 0x00FF; to ensure that the value was in the range 0..255. Apparently, packing with C will do the same thing, but now issues a "wrapped" warning. Anyway, it seems to work. Was this value, which was 65535 in @stream, the one you referred to as "don't expect gennum > 255, it's absurd." or was that the other value? $tloc was 27173, $len was 2, $tpl was 'CnC'. The first @stream was 0000000000 65535 f (I assume the 0's collapse to integer 0) and the result was 0 0 0 255 (xFFFF trimmed to xFF). The second @stream was 27173 0 n (x6A25) which gave a result of 1 106 37 0 (x6A x25). It's the same result as the original code, without the nasty warning. By the way, I was concerned about both $stream and @stream being used together, so I renamed $stream to $sstream to eliminate any possibility of one being used for the other.
My test script didn't have shebang line with a "-w", that's why I didn't see this warning! As you already found, the issue is with generation number 65535 of 0th object, i.e. what would be "0000000000 65535 f" line in classic table. I'd put the following at exact location where you suggested: $_-> [ 1 ] = 0 if $_-> [ 2 ] eq 'f' and $_-> [ 1 ] == 65535; Examples in the Reference show, that gennum 65535 can be used to mark objects as "not to be re-used", i.e. in theory, objects other than 0th can have it, too. In practice, apart from 65535 for object 0, my "absurd" comment was that probably no PDF file has ever had such long and twisted history of incremental updates, that gennum of any object is more than a dozen at most! Further, as implementation note 16 in the Reference says, "Acrobat 6.0 and later do not use the free list to recycle object numbers; new objects are assigned new numbers." That's in accordance with my observation that 0th object (1st entry) in XRef stream has gennum of 0 in PDF files I've seen. Show quoted text
> original PDF 1.5 input with some new stuff, including a cross reference stream, added after the %%EOF, if that's the correct result.
That was the whole point -- to allow incremental update for files that are using Xref stream, so that Acrobat is OK with result. As I said earlier elsewhere, other viewers don't mind if update with classic XRef table is appended to file with Xref stream. Show quoted text
> I note that some objects have the same number as those found earlier in the input -- I take it they override (replace) the earlier object of the same number?
That's correct, it's how incremental update mechanism is described in the Reference. Show quoted text
> Should this cross reference stream output be seen ONLY if the file read in was PDF 1.5 or higher, with a cross reference stream?
Correct -- in my patch, the presence of "Type" entry in $tdict and its value being "XRef" are checked. Based on that, either classic or stream Xref information is written. I believe it's robust enough (maybe some other check would be better? A flag set when file was read?) I think we should only expect well-formed PDFs, which have correct version if they use Xref stream. But, it's possible that file's version would be 1.5 and above, but, for any reason, it has classic table. Then, my patch will append classic table. Show quoted text
> Could you look at RT 81530 and see if it sounds related?
Ah, so it's old issue, and better documentation is on its way :)
On Wed Apr 24 18:25:51 2019, vadimr wrote: Show quoted text
> My test script didn't have shebang line with a "-w", that's why I > didn't see this warning!
That'll learn ya! :) I make it a habit to start each .pl with use strict; and use warnings;. Show quoted text
> $_-> [ 1 ] = 0 if $_-> [ 2 ] eq 'f' and > $_-> [ 1 ] == 65535; > > Examples in the Reference show, that gennum 65535 can be used to mark > objects as "not to be re-used", i.e. in theory, objects other than 0th > can have it, too. In practice, apart from 65535 for object 0, my > "absurd" comment was that probably no PDF file has ever had such long > and twisted history of incremental updates, that gennum of any object > is more than a dozen at most! > > Further, as implementation note 16 in the Reference says, "Acrobat 6.0 > and later do not use the free list to recycle object numbers; new > objects are assigned new numbers." That's in accordance with my > observation that 0th object (1st entry) in XRef stream has gennum of 0 > in PDF files I've seen.
OK, you explain it's OK to 0 out this particular 16-bit value (xFFFF), as no Reader that handles cross reference streams pays attention to the value anyway. Is that the only place that a value greater than xFF is ever going to show up? If it's documented that Readers don't care what the gennum value is in this case, that's fine, but I'm leery of "observations" that it always works that way -- there might be oddball Readers out there that /do/ care about this value. I could imagine someone constantly updating a PDF file for some reason, perhaps "saving" instead of "quitting" a Reader. I saw this back in the early days of JPEG file usage -- a co-worker was complaining that his JPEG images were slowing rotting away. I had to explain to him that he was saving the image each time he wanted to quit the viewer, so the image was losing more high frequency data each time! Anyway, it's not impossible that the gennum could end up > 255 in some strange situations. Since $_->[1] is going to be packed with 'C', I think it would be a good idea to stay with my fix of clearing high bits to ensure that it's in the 0..255 range. If someone /does/ get a gennum > 255, cycling back to 0 might cause problems, but that's life. At least they won't get a "wrapped" warning. If a cross reference stream is always going to handle it as a single byte, it can't be allowed to exceed 255, whatever its purpose. Should we consider treating it as 16-bits, if the standard permits? Show quoted text
> > original PDF 1.5 input with some new stuff, including a cross > > reference stream, added after the %%EOF, if that's the correct > > result.
> > That was the whole point
Let me rephrase my question, then -- is the correct result to output the original, unchanged (almost) PDF, and then tack on these new and replacement objects, and maybe a cross reference stream, after the original %%EOF? You seem to have said "yes". Show quoted text
> > Could you look at RT 81530 and see if it sounds related?
> > Ah, so it's old issue, and better documentation is on its way :)
I promise, I /will/ put something in, at least in the POD for save, saveas, and stringify! Unless you have any further updates or strong objections to something, I think I will put out PDF::Builder 3.014 this weekend, with this ticket closed.
Show quoted text
>use warnings;
But I always do, I swear! It's File.pm who doesn't! And lexically scoped "use warnings;" in my .pl can't help. OTOH, "-w" switch on command (shebang) line sets global $^W. A bit further off topic, CAM::PDF does "use warnings;", and when I modified it, in quite similar way, for internal use, to write XRef streams some years ago, the case with 65535 was caught. "Transferring" that patch to PDF::API2, I just forgot to zero gennum of 0th object. Sorry. If only I didn't forget, I hadn't to write all of the following :) Show quoted text
>Let me rephrase my question, then -- is the correct result to output the original, unchanged (almost) PDF, and then tack on these new and replacement objects, and maybe a cross reference stream, after the original %%EOF? You seem to have said "yes".
Yes. I'll try to clarify further, sorry if it may sound primitive. For performance reasons, many file formats allow incremental updates, with changes appended to intact original. (Not "almost", but 100% intact.) In GUIs, it's usually "Save" for incremental update, and "SaveAs" for clean re-write, not necessarily to another filename. It's same difference in CAM::PDF's methods "output" and "cleanoutput". With "cleanoutput", all objects are re-numbered consecutively, getting fresh gennum of "0", and "holes" of ranges of free objects are eliminated. PDF::API2 simply can't do "cleanoutput". If file is opened and then saved, it's always incremental update, however confusing method's name "saveas" is. Even if all objects were changed and original content becomes useless, it's stored intact as it was, and new content is appended. My patch in #121832 was an attempt to teach PDF::API2 to "cleanoutput". Now I think it's not worth it, since it can only output old-fashioned 1.4 classical XRef table, it tries complex and possibly fragile not-tested-enough things (as opposed to simple patch discussed in this thread), and nobody seems interested. New patch only serves one purpose: users now can modify "modern" PDF files without necessity to downgrade them to 1.4 as preliminary step, and worrying why Reader can't read their files and whether PDF::API2 works at all, or not. But it's same old incremental update. Show quoted text
>Is that the only place that a value greater than xFF is ever going to show up?
OK, consider this. For gennum to become > 255, PDF file has to be updated at least 510 times. This minimum of 510 is possible if pattern of updates is strictly that each odd save removes an object (objnum marked free on save), and some info (completely new indirect object) is added on each even save (objnum re-used, gennum increased). If this pattern is not strict, gennum ever gets to > 255 after more, possibly much more number of updates. Multiply probability of such scenario by chance that people torturing this file never get worried about file size bloat, so they don't reset the progression by issuing "SaveAs" command somewhere in the middle. Note, all of the above still happens in 1.4 era, with classical XRef table. If file gets to PDF::API2 in this state with any gennum >255 -- fine, it's not an issue for patch discussed. If this file is updated to 1.5 before getting to PDF::API2 -- fine too, it was a clean "SaveAs", all gennums reset and never touched again by Acrobat/Reader. That's why my suggestion was to set gennum to 0 instead of 255 -- it would be same "0" as in files saved with Reader. But in the end it's probably not so important. Show quoted text
>Should we consider treating it as 16-bits, if the standard permits
Is there any software that still tracks free list and re-uses objnums and increases gennums, even in Xref streams, and regardless of Adobe's own stance? I don't know! Have never heard of. I'd solve problems when (and if) they come: if anyone files a bug report and we see that it's because we (wrongly, it will appear) assumed gennums always zero (well, less than 255) in XRef streams -- fine, we'll know how to fix -- i.e. to use 'Cnn' ('CNn') template.
Both set special case '0 65535 f' to '0 0 f', and added warning and reduction of any generation number in excess of 255 (because it is packed with 'C' code). Closing RT 117184 for PDF::Builder, and fix will appear in today's 3.014 release. Again, thank you Vadim for your work on this. Please consider issuing a Pull Request for PDF::API2.
I've attached a hybrid PDF. Since they've been mentioned. These *ARENT'T* breaking under update and don't need special handling. - david On Tue Apr 02 22:13:42 2019, vadimr wrote: Show quoted text
> Phil, > > I have better alternative than patch (hack) from #121832. To please > Acrobat/Reader, incremental update can append either classical Xref > Table or compressed Xref Stream. The new patch seems to be working. > The test PDF file is from this thread. > > 1) Producing "hybrid files" to ensure "compatibility with older > applications" is not implemented (was not even contemplated -- I don't > think it's important anymore). > > 2) No support (with this patch, but would not be difficult in general) > for files > ~4 Gb. > > 3) Somewhat lousy compression (because of no prediction) if someone > updates unusually large number of objects -- i.e. generally unlikely). > > 4) Of course, updated objects are not stuffed into streams, and > furthermore this patch does nothing to "use modern compression" when > file is clean-output (IIRC, PDF::API2 can't do it anyway). > > 5) Important -- this patch also applies changes (2 topmost changes) as > per #121911. > > In fact, fixes are very minimal, existing code is mostly re-used to > collect updates made to XRef Table (instead of writing them as they > come) and then apply them appropriately in either of 2 modes. > > + One (minor) digression: documentation could be more clear that after > calling "saveas" an instance becomes unusable -- to prevent someone > writing scripts e.g. such as with commented fragment below. > > > use warnings; > use strict; > use feature 'say'; > > use PDF::API2; > > my $pdf = PDF::API2-> open( "test.pdf" ); > $pdf-> page; > $pdf-> page; > $pdf-> page; > > $pdf-> saveas( "test-mod.pdf" ); > > # $pdf-> page; > # $pdf-> page; > # $pdf-> saveas( "test-mod++.pdf" ); > > __END__
Subject: hybrid.pdf
Download hybrid.pdf
application/pdf 79k

Message body not shown because it is not plain text.

Thanks for testing, David. To handle hybrid files, this patch needs yet another tweak :-(. my $pdf = PDF::API2-> open( 'hybrid.pdf' ); #delete $pdf-> { pdf }{ XRefStm }; $pdf-> openpage( 1 )-> rotate( 180 ); $pdf-> saveas( 'hybrid+.pdf' ); $pdf = PDF::API2-> open( 'hybrid+.pdf' ); $pdf-> page; $pdf-> saveas( 'hybrid++.pdf' ); 1st page reverts to unrotated state -- because, according to the Reference, the "XRefStm" must be consulted first, before descending the "Prev"'s chain (alas, Chrome is broken). So this entry should be deleted in trailers of appended sections. I.e. delete $tdict-> { XRefStm }; inserted into "else" clause of the above patch. Further (NOT related to patch discussed, but revealed because of "hybrid.pdf"), PDF::API2 appends new content quite literally: "%%EOF3 0 obj << /Type /Page ..." etc. Though offset for object 3 is correct and no applications seem to complain, it's ugly and I doubt it's valid syntax, really, and better be fixed, i.e., ensure newline before appending. These changes aren't urgent, documented for the future.
Hi Vadim, Yeah, I caught the run-on %%EOF problem and fixed it yesterday (in PDF::Builder) by ensuring that an opened PDF ends with an EOL beyond the original final %%EOF (since new code will be appended). As for the rest of this stuff, I'm a bit confused. Do you anticipate having to patch PDF::API2 & Builder to do something with XRefStm in new trailers? How critical is this -- should I delay 3.015 release until the new patch? Is this only something that affects the Chrome PDF reader, or does it affect Acrobat Reader (and many other readers) too?
Subject: [rt.cpan.org #117184]
Date: Tue, 14 May 2019 13:06:05 +0300
To: bug-pdf-api2 [...] rt.cpan.org
From: vadim repin <futuramedium [...] yandex.ru>
Phil, actually it's new and unrelated issue that can affect hybrid files. Perhaps not very critical, as it always was there. As example shows, it is easy to modify a file so that XRefStm points to outdated information. This key simply must not be preserved. To fix, single line of code can be added immediately after "else" line. I mentioned Chrome just as a fun fact, it does not follow specification strictly, which appears to "cancel out" the issue and possibly adds to confusion.
OK, so + else { + $fh->print("xref\n", @out, "trailer\n"); + $tdict->outobjdeep($fh, $self); + $fh->print("\n"); + } + $fh->print("startxref\n$tloc\n%%EOF\n"); } should become + else { + delete $tdict->{'XRefStm'}; + $fh->print("xref\n", @out, "trailer\n"); + $tdict->outobjdeep($fh, $self); + $fh->print("\n"); + } + $fh->print("startxref\n$tloc\n%%EOF\n"); } ? Does this assume there is already a XRefStm entry in the existing PDF (that we want to use)? Should there be a check added that there is, before deleting the new one, or is it safe to assume there always is an existing one?
The code change is correct. *Maybe* (I now think) it would be better to move this line into sub's caller, where $tdict is created by copying the existing trailer dictionary, and just get rid of XRefStm unconditionally. Of course not every PDF contains "XRefStm"! Deleting non-existent hash elements is safe and a no-op. If (from maintenance POV?) you'd prefer "delete something{foo} if exists something{foo}" -- OK, write it so. For me it's more effort to read and tautology, in a sense.
On Tue May 14 18:15:28 2019, vadimr wrote: Show quoted text
> The code change is correct. *Maybe* (I now think) it would be better > to move this line into sub's caller, where $tdict is created by > copying the existing trailer dictionary, and just get rid of XRefStm > unconditionally.
Code efficiency improvement, or change of behavior? Show quoted text
> Of course not every PDF contains "XRefStm"! Deleting non-existent hash > elements is safe and a no-op. If (from maintenance POV?) you'd prefer > "delete something{foo} if exists something{foo}" -- OK, write it so. > For me it's more effort to read and tautology, in a sense.
My point (which I guess I didn't make clearly enough) was that if the existing PDF we're appending to did NOT already have an XRefStm, what is lost (or gained) by unconditionally NOT putting in a new one? If the existing PDF did not have one, and we add a cross reference stream but no XRefStm, what are the consequences? If it DID have one, what are the consequences of adding a second one? I just want to be clear in my mind on all these angles before I make your code change.
Subject: [rt.cpan.org #117184]
Date: Thu, 16 May 2019 15:05:50 +0300
To: bug-pdf-api2 <bug-pdf-api2 [...] rt.cpan.org>
From: vadim repin <futuramedium [...] yandex.ru>
> Code efficiency improvement, or change of behavior?

Neither, just keeping related things close together, for coherence/ease of maintenance. Decisions about content of trailer of new section, including what entries from existing trailer to keep, are made very near $tdict creation at line 341 (https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Basic/PDF/File.pm#L341). No reason to delete XRefStm 1000+ LOCs away. But it's not very important.

Reference:

"Note: Table 3.17 defines an additional entry, XRefStm, that appears _only_ in the trailer of hybrid-reference files, described in “Compatibility with Applications That Do Not Support PDF 1.5” on page 109."

1.4-compatible consumer doesn't know about "only" (nor, of course, what to do with XRefStm), but:

"The added trailer contains _all_ the entries (perhaps modified) from the previous trailer, as well as a Prev entry giving the location of the previous cross-reference section..."

PDF::API2 is 1.5-consumer, _and_ it can't save hybrid files, therefore neither "old" XRefStm is kept in appended sections (see rotated page example), nor "new" one is added. That "hybrid" was bad design idea, on Adobe side, anyway, but that's too much off topic. I hope the above quotes answer your other questions. Just consider how pdf-reader follows xrefs sections chain, looking for /Prev in trailers and (since it's 1.5-compatible) for /XrefStm, too -- /XRefStm is checked first before descending further, if object wasn't found yet.
 
That original section remains to be "hybrid". What we are appending is "classic". If original was not "hybrid" but pure "1.5 xref stream", then we are appending pure "xref stream" section and XRefStm is not required.