Bug #105579 for PDF-API2: Given same input, different (byte- and sizewise) PDF files are created

Tue Jun 30 14:03:54 2015 DMITRI [...] cpan.org - Ticket created

Subject:

Given same input, different (byte- and sizewise) PDF files are created

I am use PDF::API2 to create very simple, text-only, PDF files. I noticed that when I use the same text as input, PDF::API2 produces different output -- the files have different sizes. I am attaching a program to demonstrate. Run it three times: changes are, you will end up with three PDF files of different sizes. I used ImageMagick's convert utility to convert these PDFs to GIFs: the GIFs are identical. This is good, but I think it the source of randomness should be removed, just for one's sanity's sake.

Subject:

random-size-pl.txt

use strict; use warnings; use Getopt::Long; use PDF::API2; GetOptions( "n-pages=i" => \(my $n_pages = 100), ); my $text = <<'TEXT'; Messages consist of lines of text. No special provisions are made for encoding drawings, facsimile, speech, or structured text. No significant consideration has been given to questions of data compression or to transmission and storage efficiency, and the standard tends to be free with the number of bits con- sumed. For example, field names are specified as free text, rather than special terse codes. A general "memo" framework is used. That is, a message con- sists of some information in a rigid format, followed by the main part of the message, with a format that is not specified in this document. The syntax of several fields of the rigidly-formated ("headers") section is defined in this specification; some of these fields must be included in all messages. The syntax that distinguishes between header fields is specified separately from the internal syntax for particular fields. This separation is intended to allow simple parsers to operate on the general structure of messages, without concern for the detailed structure of individual header fields. Appendix B is provided to facilitate construction of these parsers. In addition to the fields specified in this document, it is expected that other fields will gain common use. As necessary, the specifications for these "extension-fields" will be published through the same mechanism used to publish this document. Users may also wish to extend the set of fields that they use privately. Such "user-defined fields" are permitted. The framework severely constrains document tone and appear- ance and is primarily useful for most intra-organization communi- cations and well-structured inter-organization communication. It also can be used for some types of inter-process communica- tion, such as simple file transfer and remote job entry. A more robust framework might allow for multi-font, multi-color, multi- dimension encoding of information. A less robust one, as is present in most single-machine message systems, would more severely constrain the ability to add fields and the decision to include specific fields. In contrast with paper-based communica- tion, it is interesting to note that the RECEIVER of a message can exercise an extraordinary amount of control over the message's appearance. The amount of actual control available to message receivers is contingent upon the capabilities of their individual message systems. TEXT my $pdf = PDF::API2->new; my $font = $pdf->corefont('Courier'); for (my $n = 0; $n < $n_pages; ++$n) { my @lines = split /\n/, $text; # Change the text up a little bit (move a line to the first # position), so that I can tell that there is more than one # page in a GIF when I convert it. (I use ImageMagick's # convert utility to convert PDFs to GIF to do pixel-by-pixel # comparison). my $pick_a_line = splice @lines, $n % @lines, 1; my $page_text = join "\n", $pick_a_line, @lines; my $page = $pdf->page; $page->mediabox(612, 792); my $content = $page->text; $content->translate(0, 780); $content->font($font, 12); $content->lead(12); $content->section($page_text, 612, 780); } $pdf->saveas($ARGV[0]);

Tue Jun 30 14:41:31 2015 steve [...] deefs.net - Correspondence added

This is normal. Dictionaries (hashes) won't always output in the same order, and PDF::API2 uses timestamps to generate IDs. Both of these can impact compression, resulting in PDFs with different sizes even though they're generated by the same script.

Tue Jun 30 14:41:31 2015 The RT System itself - Status changed from 'new' to 'open'

Tue Jun 30 14:41:32 2015 steve [...] deefs.net - Status changed from 'open' to 'rejected'

Tue Jun 30 15:39:44 2015 DMITRI [...] cpan.org - Correspondence added

This is interesting. I see a benefit to having a deterministic behavior in this regard: one could use size and contents (minus the timestamp and other fixed-lenth stuff in PDF header and footer) to check for regression.

Wed Jul 01 11:42:35 2015 philperry [...] hvc.rr.com - Correspondence added

Subject:	[rt.cpan.org #105579]
Date:	Wed, 1 Jul 2015 11:43:05 -0400
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Phil M Perry <philperry [...] hvc.rr.com>

While it's not a critical problem, I agree that it would be nice for output to be more deterministic, so that integrity checks such as DMITRI proposes could be easily made. Let's think about /why/ IDs are generated with timestamps rather than some deterministic counter, and would that be sufficient to make multiple PDF document runs (from the same source) essentially the same (except for header timestamps). It is possible that the original author was simply lazy, and picked a timestamp (hopefully microsecond precision) for a unique ID, rather than going through the effort of tracking some global counter instead. What is "normal" practice for PDF generation in Acrobat and other packages?

Wed Mar 16 11:02:07 2016 philperry [...] hvc.rr.com - Correspondence added

Subject:	[rt.cpan.org #105579]
Date:	Wed, 16 Mar 2016 11:02:09 -0400
To:	bug-PDF-API2 [...] rt.cpan.org
From:	Phil M Perry <philperry [...] hvc.rr.com>

See also #113084. It sounds like the same issue.