Skip Menu |

This queue is for tickets about the PDF-API2 CPAN distribution.

Report information
The Basics
Id: 122962
Status: resolved
Priority: 0/
Queue: PDF-API2

People
Owner: Nobody in particular
Requestors: andy [...] andybev.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Reusing PDF::API2 objects for different PDFs
Date: Tue, 5 Sep 2017 12:55:47 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: Andrew Beverley <andy [...] andybev.com>
Firstly, thanks for a great module. I am using it to generate a PDF with many pages. Producing the whole PDF as one object in one go uses huge amounts of memory, so I now produce each page one-by-one and then concatenate them afterwards using CAM::PDF. This works well, in that significantly less memory is used, but it is slow, as I am creating a new PDF::API2 object each time. From the small amount of profiling I have done, a lot of time seems to be spent adding the TTF fonts. I wondered, is there some way to reuse the PDF::API2 object (or just the fonts) and create a fresh page each time? I have tried various hacks (I won't detail them all here), such as reusing the ttfont object in multiple PDFs, deleting the pages from the object, and so on, but I couldn't get any to work. Do you have any suggestions please? If you do, and it involves some coding, I would be happy to investigate providing a patch. Thanks, Andy
On Tue Sep 05 08:04:05 2017, abeverley wrote: Show quoted text
> I am using it to generate a PDF with many pages. Producing the whole > PDF as one object in one go uses huge amounts of memory, so I now > produce each page one-by-one and then concatenate them afterwards using > CAM::PDF. > > This works well, in that significantly less memory is used, but it is > slow, as I am creating a new PDF::API2 object each time. > > From the small amount of profiling I have done, a lot of time seems to > be spent adding the TTF fonts. I wondered, is there some way to reuse > the PDF::API2 object (or just the fonts) and create a fresh page each > time? > > I have tried various hacks (I won't detail them all here), such as > reusing the ttfont object in multiple PDFs, deleting the pages from the > object, and so on, but I couldn't get any to work. > > Do you have any suggestions please? If you do, and it involves some > coding, I would be happy to investigate providing a patch.
There are probably some ways to speed up that operation, but depending on what kind of coding you're up for trying, it might be possible to solve your original problem instead. Take a look at my comments on ticket 113516. Currently, when PDF::API2 opens a file, it reads the whole thing into memory, but that wasn't always the case, and the code that PDF::API2 is built on top of doesn't require that everything be loaded in memory either. It's theoretically possible for you to create a number of pages, write those out to disk, free up the memory, and repeat, without closing and reopening the file. If you want to start down that trail, look at PDF::API2->finishobjects() and follow the path for details about writing out a file in chunks. Freeing the memory without closing the file may be trickier (I haven't looked into that yet). I'm guessing it'll involve the release_obj() call in PDF::API2::Basic::PDF::File -- if I'm reading the code correctly, that will remove it from the various caches, but without actually removing it from the PDF. The release() call will almost definitely free the memory, but I think that's only supposed to be called when you're done with the file. As an aside, several comments in the code mention circular references. As of a release or two ago, those should no longer exist (if you find any, please give me a test case), so that should simplify things. If you get to a point where you can call finishobjects() more than once and get a working file, but are still running out of memory, let me know (preferably with sample code) and we can dive into that problem more deeply. If that ends up being too complicated and you'd rather keep trying to speed up the ttfont calls, it should be possible to reuse the time-consuming part of that object's creation. It may be as simple as calling $new_pdf->{'pdf'}->new_obj($font_object_from_old_pdf) instead of $new_pdf->ttfont(...). That definitely wouldn't qualify as intended/supported behavior, but it might work. -- Steve
Subject: Re: [rt.cpan.org #122962] Reusing PDF::API2 objects for different PDFs
Date: Tue, 12 Sep 2017 11:52:24 +0100
To: bug-PDF-API2 [...] rt.cpan.org
From: Andrew Beverley <andy [...] andybev.com>
Hi Steve, thanks for the quick and comprehensive reply. I've spent a while trying your suggestions (comments below), but am unfortunately no further forward. At this point I should say that this is more of a nice to have than an essential requirement, so if there are no quick-wins for either of us then I will be happy for you to close the ticket. Have a look at the below if you get the time anyway, and let me know what you think. Show quoted text
> Take a look at my comments on ticket 113516. Currently, when > PDF::API2 opens a file, it reads the whole thing into memory, but > that wasn't always the case, and the code that PDF::API2 is built on > top of doesn't require that everything be loaded in memory either.
Thanks. I don't *think* this particular information helps, as I am writing out, not reading. Show quoted text
> It's theoretically possible for you to create a number of pages, > write those out to disk, free up the memory, and repeat, without > closing and reopening the file. If you want to start down that > trail, look at PDF::API2->finishobjects() and follow the path for > details about writing out a file in chunks. > > Freeing the memory without closing the file may be trickier (I > haven't looked into that yet). I'm guessing it'll involve the > release_obj() call in PDF::API2::Basic::PDF::File -- if I'm reading > the code correctly, that will remove it from the various caches, > but without actually removing it from the PDF. The release() call > will almost definitely free the memory, but I think that's only > supposed to be called when you're done with the file.
I've spent a while playing around with the above. I seem to be able to write out a PDF in chunks, but whenever I try to do so along with calls to free the memory, I run into problems. The finishobjects() in itself doesn't seem to make any difference to memory use, and whenever I try it with something like a save or release_obj then I get: Can't call method "new_obj" on an undefined value at /usr/share/perl5/PDF/API2/Basic/PDF/Pages.pm line 92 Show quoted text
> If you get to a point where you can call finishobjects() more than > once and get a working file, but are still running out of memory, > let me know (preferably with sample code) and we can dive into that > problem more deeply.
I should have said before that I am using PDF::TextBlock. I don't think this affects the principle though, as I run into similar problems if I remove it and write lots of text using raw calls. Anyway, FWIW, here is a MWE: my $pdf = PDF::API2->new(-file => 'mypdf.pdf'); for my $count (1..100) { my $page = $pdf->page; my $tb = PDF::TextBlock->new({ pdf => $pdf, page => $page, x => 100, y => 100, }); for my $count2 (1..20) { $tb->text("Text $count2"); $tb->apply; } $pdf->finishobjects; } $pdf->save; Show quoted text
> If that ends up being too complicated and you'd rather keep trying > to speed up the ttfont calls, it should be possible to reuse the > time-consuming part of that object's creation. It may be as simple > as calling $new_pdf->{'pdf'}->new_obj($font_object_from_old_pdf) > instead of $new_pdf->ttfont(...). That definitely wouldn't qualify > as intended/supported behavior, but it might work.
Given the relatively modest potential gains, I've decided this is probably best avoided! Thanks again, and please do feel free to close this ticket if it all looks like too much hassle. Andy