Skip Menu |

This queue is for tickets about the PDF-API2 CPAN distribution.

Report information
The Basics
Id: 121911
Status: resolved
Priority: 0/
Queue: PDF-API2

People
Owner: Nobody in particular
Requestors: futuramedium [...] yandex.ru
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 2.031
Fixed in: 2.034



Subject: Adding pages to existing documents doesn't work if page tree is anything but very simple (+ maybe fixed)
use strict; use warnings; use feature 'say'; use PDF::API2; my $pdf = PDF::API2-> open( 'sample.pdf' ); $pdf-> page; $pdf-> saveas( 'sample+.pdf' ); Can't call method "new_obj" on an undefined value at C:/PDL24/perl/site/lib/PDF/API2/Basic/PDF/Pages.pm line 95. Sample.pdf is attached. Its 'Pages' root has a single intermediate node, which further has 11 leaves i.e. pages. 'Pages' are blessed in line 521 of PDF::API2::Basic::PDF::File, i.e. not in constructor of PDF::API2::Basic::PDF::Pages. In line 61 of latter module there's assignment to ' outto' property. If file is opened, this property is never assigned. But in line 207 it is used to create intermediate node. Hence, the above error message. I suggest adding after line 521 in PDF::API2::Basic::PDF::File: $result-> {' outto'} = [ $self ]; Then script completes, but blank page is prepended (??). It looks like this problem is in line 213 of PDF::API2::Basic::PDF::Pages. Condition is always false, it should be changed to $pindex = -1 if ($pindex == $ppnum - 1); Then document is modified as expected. I didn't test these fixes extensively, because I don't yet understand fully internal mechanics of this module.
Subject: sample.pdf
Download sample.pdf
application/pdf 120.1k

Message body not shown because it is not plain text.

From: futuramedium [...] yandex.ru
No, still not good enough. Adding another page in above script, i.e. duplicating "$pdf-> page;" results in broken PDF file. Looks like increment statement in "for" loop in line 192 of PDF::API2::Basic::PDF::Pages reads from file and thus previously made changes to object in memory are lost. I think if object already was ' realised', then it should not be read again. Is that what this flag is for? Then the whole "realise" method of PDF::API2::Basic::PDF::Objind -- line 157 -- should be replaced with: sub realise { my ( $self ) = @_; return $self if $self-> { ' realised' } or !$self-> { ' objnum' }; $self-> { ' realised' } = 1; $self-> { ' parent' }-> read_obj( $self ) } Further, line 539 of PDF::API2::Basic::PDF::File should be deleted (commented out), because it just makes no sense (??). With these changes it seems to work, but it requires more tests from someone who understands internal logic of this distribution. And, change to line 213 in report above should be duplicated, for the same reason, in line 188 of PDF::API2::Basic::PDF::Pages.
From: futuramedium [...] yandex.ru
Sorry, I forgot about related issue: the PDF::API2::Basic::PDF::Pages::rebuild_tree is a no-op, and its return contradicts to what POD says. + Speaking about strange code fragments (unrelated issue). What does line 815 of PDF::API2::Basic::PDF::File was supposed to do? + Another unrelated issue: lines 163-181 of PDF::API2::Basic::PDF::Dict are meant to replace LZWDecode with FlateDecode, preventing second FlateDecode in Filter array. But this fragment will never work. (1) It assumes Filter is always an array. (2) Splicing is performed on array that is just not used further. I think @filters array was intended to be spliced. (3) If LZWDecode happens before FlateDecode, it's just replaced with FlateDecode. If there was another FlateDecode further in array, there'll be two of them. (4) If LZWDecode happens after FlateDecode, it's not pushed to @filters. And what is supposedly spliced later, is some other different filter. Ehm, maybe it's better be replaced with: @filters = map { $_-> val } ref $self-> { Filter }{ ' val' } eq 'ARRRAY' ? @{ $self-> { Filter }{ ' val' }} : ( $self-> { Filter }); my $hasflate = grep { /^FlateDecode$/ } @filters; @filters = map { s/^LZWDecode$/FlateDecode/ && $hasflate ? () : "PDF::API2::Basic::PDF::Filter::$_"-> new } @filters; But of course there hardly exists a PDF file with both LZWDecode and FlateDecode in Filter array.
This one is difficult to figure out, because 1) it folds many different issues into one report, and some may duplicate or have been fixed in other bug reports (almost two years since the last update by OP), and 2) reference only to line numbers are difficult to locate when the underlying code has been changed (especially a problem in PDF::Builder, which has had many updates). It would help if the OP could a) clearly split this into separate bugs (perhaps even closing this report and opening multiple others), b) confirm that the problems still exist in the latest releases of API2 and Builder, and c) give a few lines of code rather than just a line number, to assist coders in figuring out what's going on. Then it would have a chance of being resolved. Thank you!
Hi, Phil. You are right, I shouldn't have piled everything up in single thread, thus messing everything and making life difficult for maintainers. Don't know what I was thinking about, sorry. Using line numbers seemed good idea, sorry about it, too. Please disregard previous message, i.e. #1730162 completely, they are bunch of small unrelated issues. Maybe I'll address them later elsewhere. The problem in original post, however, is still present. Can I re-phrase it more clearly here, without creating a fresh ticket? I.e. can I save this thread? Comparing code (line numbers) between 2.031 and current 2.033, I'm suggesting same patches, actually. Please find test file in OP. use strict; use warnings; use feature 'say'; use PDF::API2; say $^V; say $PDF::API2::VERSION; my $pdf = PDF::API2-> open( 'sample.pdf' ); $pdf-> page; $pdf-> page; $pdf-> page; $pdf-> saveas( 'sample+.pdf' ); __END__ v5.26.0 2.033 Can't call method "new_obj" on an undefined value at C:/berrybrew/5.26.0_64_PDL/perl/site/lib/PDF/API2/Basic/PDF/Pages.pm line 98. After applying 3 patches, the program above completes successfully, and PDF file is modified as expected. On Fri Mar 22 11:49:26 2019, PMPERRY wrote: Show quoted text
> This one is difficult to figure out, because 1) it folds many > different issues into one report, and some may duplicate or have been > fixed in other bug reports (almost two years since the last update by > OP), and 2) reference only to line numbers are difficult to locate > when the underlying code has been changed (especially a problem in > PDF::Builder, which has had many updates). It would help if the OP > could a) clearly split this into separate bugs (perhaps even closing > this report and opening multiple others), b) confirm that the problems > still exist in the latest releases of API2 and Builder, and c) give a > few lines of code rather than just a line number, to assist coders in > figuring out what's going on. Then it would have a chance of being > resolved. Thank you!
Subject: File.diff.txt
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\File.pm Tue Mar 26 12:20:06 2019 @@ -522,6 +522,7 @@ if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) { bless $result, $types{$result->{'Type'}->val}; + $result-> {' outto'} = [ $self ]; } # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. @@ -540,7 +541,7 @@ } $result->{' parent'} = $self; weaken $result->{' parent'}; - $result->{' realised'} = 0; +#?? $result->{' realised'} = 0; # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. }
Subject: Objind.diff.txt
--- PDF\API2\Basic\PDF\Objind.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\Objind.pm Tue Mar 26 12:11:42 2019 @@ -154,7 +154,13 @@ =cut sub realise { - $_[0]->{' realised'} ? $_[0] : $_[0]->{' objnum'} ? $_[0]->{' parent'}->read_obj(@_) : $_[0]; + my ( $self ) = @_; + + return $self if $self-> { ' realised' } or + !$self-> { ' objnum' }; + + $self-> { ' realised' } = 1; + $self-> { ' parent' }-> read_obj( $self ) } =head2 $r->outobjdeep($fh, $pdf)
Subject: Pages.diff.txt
--- PDF\API2\Basic\PDF\Pages.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\Pages.pm Tue Mar 26 12:24:25 2019 @@ -191,7 +191,7 @@ { for ($pindex = 0; $pindex < $ppnum; $pindex++) { last if ($ppages->{'Kids'}{' val'}[$pindex] eq $ppage); } - $pindex = -1 if ($pindex == $ppnum); + $pindex = -1 if ($pindex == $ppnum - 1); } $ppages->add_page_recurse($page->realise, $pindex); @@ -216,7 +216,7 @@ $ppnum = scalar $ppages->{'Kids'}->realise->elementsof; for ($pindex = 0; $pindex < $ppnum; $pindex++) { last if ($ppages->{'Kids'}{' val'}[$pindex] eq $self); } - $pindex = -1 if ($pindex == $ppnum); + $pindex = -1 if ($pindex == $ppnum - 1); $ppages->add_page_recurse($newpages, $pindex); } }
Thank you for the update to the problem and the suggested patches. I can replicate the problem in PDF::Builder. I want to carefully investigate your suggested patches and make sure I fully understand them and am happy with them before I commit them (to PDF::Builder's repository -- it's up to Steve to verify and patch on PDF::API2). It's possible I will be back here with further questions or issues, unless you want to shift the discussion over to GitHub (https://github.com/PhilterPaper/Perl-PDF-Builder/issues/66). "Can I re-phrase it more clearly here, without creating a fresh ticket? I.e. can I save this thread?" This appears to be adequate information for a fix to be made. Will this fix all the issues reported in the 121911 ticket, or should the rest of them be ignored for now (i.e., are not problems, or are quite minor)? I know that at least one of them (the line 815) was fixed long ago. Perhaps you should open one or more new tickets to deal with them, just to keep issues cleanly separated.
Hi Vadim, I looked over your changes and have some questions. I needed to omit one of them to get the example to work (to get three new pages added instead of just two). File.pm I have no idea what $result->{' outto'} = [ $self ]; does (Dictionary entry), but it seems to be necessary. Any idea what it does? $result->{' realised'} = 0; (Indirect Object entry) I don't know what it does, but it seems to be necessary (to keep it). Any idea what it does? I HAD TO LEAVE THIS STATEMENT IN, OR I ONLY GOT TWO BLANK PAGES INSTEAD OF THREE. Acrobat Reader also seemed to be quite slow (with two new pages), and I suspect it was repairing damage in the PDF. Objind.pm $self->{' realised'} = 1; looks plausible... it would make sense to set this flag in the "realise" routine. Other than adding realised=1; to realise(), your changes are the same as the original code. One thing I'm wondering about is the call to read_obj(@_). Normally $self as $_[0] is removed (via shift), but it's left in here. $self gets added again as part of the call, so it's more or less ($self, $self, any_options) passed to read_obj(). The second usage is $objind, which is a hash pointer, so it seems to work. It just looks odd. Does it seem correct to you? You just let realise() "run off the end" (as did the original code). I modified the code to explicitly return either $self or the read_obj() value. It works, anyway, and perlcritic is happy. Pages.pm Both your changes ($pindex == $ppnum -1) seem to be necessary, but I'm wondering why they work. It appears that $pnum = -1 means "add new page at end" (insert before virtual end page, not physical last page). The loop (3 lines before) goes through all the children and tries to match the given page number ($pindex). If it matched the last one, $pindex should be $ppnum - 1. If it failed to match any, $pindex should be $ppnum, shouldn't it? Your change is to set it to -1 (add after all existing pages) if it matched the LAST page? Yet... it seems to work. Any idea what the author was trying to do? Is a match on one of the existing pages guaranteed? Should the test be >= instead of ==? A thousand curses on the head of the author(s) of this code, who couldn't be bothered to add a few comment notes on what they were doing. I guess they're of the "if it was hard to write, it should be hard to read" school of programming. I see in your earlier posts you repeatedly throw up your hands and are hoping that an expert will come along and explain things. Me too. HOLD THE PHONE -- t/circular-references.t fails with these changes. I ran all the t-tests and I think this is the one failure. I haven't tried all the examples yet. perlcritic does run clean (level 5 clean, no unexpected level 4). ============================ "Line 815": according to my notes, removed from PDF::API2 and PDF::Builder July 2017. That may be after the last PDF::API2 release. "rebuild_tree" dummy stub and not matching POD: I fixed the POD (warned coders not to use it) in PDF::Builder. I guess PDF::API2 isn't fixed yet. Does this leave only the possible LZWDecode/FlateDecode issue to be dealt with? That should probably be a new ticket so this one can be closed.
By the way, my testing was with PDF::Builder, not PDF::API2, so your results may differ (hopefully not).
Your results are strange, because patches work for me, both for PDF::API2 and PDF::Buider. You either did something wrong (then check again, please) or, perhaps, you are working with some internal modified version. As to questions: 1) The issue with "$result->{' outto'} = [ $self ];" was described in the OP. File is either created anew or read from disk. In former case, the ' outto' property is initialized i.e. assigned. In latter case, it is not (maybe, rather, it is what's to be fixed?). All nodes in page tree seem to require this property to output themselves. 2) The "$result->{' realised'} = 0;" is useless because in all conditionals where this key is tested it is not compared to zero, but simply checked for truth. So it's OK to be non-existent. If you don't like it, then lift that line a few lines higher into "unless" block (as original author did in "copy" subroutine, please find). What I think is going on: Suppose some object is referenced in 2 different parts of PDF, e.g. as "6 0 R". When 1st fragment is parsed, Objind is created, but object #6 itself is not "realized" i.e. found through xref table and parsed from source. Then my program does something (call it X) that _does_ require this #6 to be "realized". OK, and it was. Later, for any reason, 2nd fragment of "6 0 R" was parsed. New Objind wasn't created, but the ' realized' flag was unset!!! Oops. Now, what if between realizing #6 and reading 2nd fragment I made any changes to #6's representation in memory? Very well. But now I do something similar (or not similar, doesn't matter) to (X) which resulted in checking if the #6 was "realized" and its subsequent "realization". Remember, flag has been unset, old version is _now_ parsed from source again, previous changes are lost. 3) The "sub realize" -- I think you misread it, the invocant of "read_obj" is not $self, so nothing strange is happening. Explicit return is right thing to do, but, though not your fault, it can lead to ugly and impossible-to-be-correct code, e.g. subroutine "val" in same module. Looks suspicious in original, too. (As you see, I can't help but digress.) 4) Pages.pm -- About this one I was absolutely wrong, sorry I have mislead anyone, my patch was invalid, new patch is attached. Thank you that you made me try to understand that code yet again. Two separate lines of code in question (which were both modified in invalid patch) look deceptively similar -- the same! But they are very different. As you see, 1st one is now left alone, because it does as described in POD: if position to insert new page is too high, then insert it last. OTOH, in 2nd fragment (the only present in suggested new patch) the loop directly above is guaranteed to trigger its "last", and it is only reachable if (1) we are adding to intermediate (not root) node, (2) it's being over-filled (so, new container is added to parent), (3) new container is, strictly, either prepended ($index == 0) or appended ($index == -1). So, the only thing I need to do here is to check $index and increase position to append (or otherwise it would be "prepending"). 5) As to issues with circular refs -- can't say anything for now. My "in-production" long-running script (though it uses modified CAM::PDF, I suspect it also leaks) starts asynchronously a new perl process to process each PDF file because I gave up on tracking memory, so it's reclaimed by OS, and with just thousands PDF files per day this setup OK. 6) Issue with LZWDecode/FlateDecode I'd rather revoke, its probability in real world being almost zero. Can't agree on "curses on authors' :)
Subject: Pages.diff.new.txt
--- PDF\API2\Basic\PDF\Pages.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\Pages.pm Sat Mar 30 01:14:48 2019 @@ -216,7 +216,7 @@ $ppnum = scalar $ppages->{'Kids'}->realise->elementsof; for ($pindex = 0; $pindex < $ppnum; $pindex++) { last if ($ppages->{'Kids'}{' val'}[$pindex] eq $self); } - $pindex = -1 if ($pindex == $ppnum); + $pindex ++ if $index == -1; $ppages->add_page_recurse($newpages, $pindex); } }
Thank you for the response. It may take me a few days to digest it and try out the changes -- and I also have to finish my income tax returns! Re #3 (sub realise), my point was that a method call adds a reference to its object (?) to the beginning of the parameter list, and usually it's immediately pulled out as "$self". Leaving the original @_ alone and passing it in, you now have two references (plus options stuff) in the call to read_obj. The first I think is the object containing method read_obj, and the second is $self, the object containing the caller to read_obj. I really wish the author had been more straightforward and done something like: my $self = shift; # just option stuff left in @_ ... use $self instead of $_[0] in calls ... ... ->read_obj($self, @_) ... That should do the same thing as the current code, but be a lot clearer (at least to me). I really despise coders who want to show off how smart they are by collapsing an entire function into one impossible-to-read line! Re #6 (compression list), OK by me to drop it for the time being. It's still archived if someone wants to revisit it in the future (in the futuramedium?). So, once the primary fixes in this bug are completed, that's everything and the bug can be closed? I will try out your latest change soon, and see what's going on. I'm still worried about having to drop the realised=0 line in order to get three new pages rather than two -- that may point to other changes that are interfering. Are you testing on PDF::Builder 3.013 (CPAN release) or 3.014 (GitHub current release)? Are you running the t-test suite on your changes?
Yes, once patches to 3 files are tested and accepted, I'm for closing this bug. I downloaded the release from GitHub, and, again, the "$result->{' realised'} = 0;" line should be removed, or otherwise invalid PDF file is produced. I'm attaching 3 modified files (they are for PDF::Builder distribution), if they are not working for you, then it's some mysticism. The issue with failing test for circular references is easily fixed, see new patch attached (and that's for PDF::API2 distribution). The new line is copied from Pages constructor -- this constructor is not called if PDF is "opened" from file.
Subject: File.diff.new.txt
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017 +++ PDF\API2\Basic\PDF\File.pm Sun Mar 31 15:41:05 2019 @@ -522,6 +522,8 @@ if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) { bless $result, $types{$result->{'Type'}->val}; + $result-> {' outto'} = [ $self ]; + weaken $_ for @{$result->{' outto'}}; } # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. @@ -540,7 +542,7 @@ } $result->{' parent'} = $self; weaken $result->{' parent'}; - $result->{' realised'} = 0; +#?? $result->{' realised'} = 0; # gdj: FIXME: if any of the ws chars were crs, then the whole # string might not have been read. }
Subject: PDF.zip
Download PDF.zip
application/x-zip-compressed 21k

Message body not shown because it is not plain text.

Vadim, Your latest changes (including adding a "weaken" statement) seem to be working. Thank you! Not only does your test case work, but it also passes all the t-tests, examples, contributions, and perlcritic. Are you still doing any testing? I will wait a couple days or so, and if you have not made any further changes, commit these three files to GitHub and close the ticket (for PDF::Builder). It's up to Steve what to do for PDF::API2 (consider making a Pull Request?). I will probably have PDF::Builder 3.014 out on CPAN by the end of April.
FYI, this ticket's (121911) changes have been pushed to GitHub and will be in PDF::Builder 3.014 later this month. Thanks again, Vadim, for the hard work. Please consider making a Pull Request to Steve for PDF::Builder.
On Fri Apr 05 09:07:40 2019, PMPERRY wrote: Show quoted text
> FYI, this ticket's (121911) changes have been pushed to GitHub and > will be in PDF::Builder 3.014 later this month. Thanks again, Vadim, > for the hard work. Please consider making a Pull Request to Steve for > PDF::Builder.
...a Pull Request to Steve for PDF::API2.
This was a long discussion, not all of which related to PDF::API2, so for reference I'm only replying to the first two comments, though I've read the rest. Thanks for your test files and debugging efforts. I just spent most of the day learning, updating, and rewriting the relevant code to improve its clarity, remove unnecessary bits, and subsequently fix the issues you raised. Everything should be working at this point at HEAD. If it isn't, you should at least have more descriptive variable names to work with. :-) In addition, some of the more complicated bits (like ' outto') have been eliminated altogether.