Bug #35555 for CAM-PDF: patch to improve node traversal

Wed Apr 30 17:36:27 2008 rt.cpan.org [...] darkart.com - Ticket created

Subject:	patch to improve node traversal
Date:	Wed, 30 Apr 2008 21:36:00 +0000
To:	bug-CAM-PDF [...] rt.cpan.org
From:	Eric Hall <rt.cpan.org [...] darkart.com>

Hello- Attached is a patch against 1.13 which improved PDF node traversal for us. -eric

Message body is not shown because sender requested not to inline it.

Wed Apr 30 21:02:59 2008 chris [...] chrisdolan.net - Correspondence added

Subject:	Re: [rt.cpan.org #35555] patch to improve node traversal
Date:	Wed, 30 Apr 2008 20:00:51 -0500
To:	RT <bug-CAM-PDF [...] rt.cpan.org>
From:	Chris Dolan <chris [...] chrisdolan.net>

Hi Eric, There are several problems with this patch as written. I've included criticisms below, which I intend constructively. Clearly you've put some work into this, but I cannot accept it as is. Furthermore, it would help if you explained your patch more thoroughly. In particular, you say that the patch improved traversal, but you don't say whether it improves performance or correctness. Is there a problem with the existing traverse routine? Chris On Apr 30, 2008, at 4:36 PM, Eric Hall via RT wrote: Show quoted text

> > Wed Apr 30 17:36:27 2008: Request 35555 was acted upon. > Transaction: Ticket created by rt.cpan.org@darkart.com > Queue: CAM-PDF > Subject: patch to improve node traversal > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: rt.cpan.org@darkart.com > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > > > > Hello- > Attached is a patch against 1.13 which improved PDF node > traversal for us. > > > -eric > > > > --- lib/CAM/PDF.pm.orig 2007-11-28 21:42:46.000000000 -0800 > +++ lib/CAM/PDF.pm 2008-03-27 20:06:54.000000000 -0700 > @@ -566,6 +566,9 @@ > $CAM::PDF::errstr = "Could not decipher xref row:\n" . > $self->trimstr($row); > return; > } > + if ((0 == $1) && (0 == $2)) { > + next; > + } > if ($type eq 'n') > { > $index->{$objnum} = $indexnum;

This addition is dead code. The case of a zero indexnum is already handled in the conditional above this. Show quoted text

> @@ -3785,6 +3788,7 @@ > my $otherdoc = shift; > my $otherkey = shift; > my $follow = shift; > + my %traversedNodes = (); > > # careful! 'undef' means something different from '0' here! > if (!defined $follow) > @@ -3838,10 +3842,10 @@ > my $newkey = $self->appendObject($otherdoc, $oldrefkey, 0); > $newrefkeys{$oldrefkey} = $newkey; > } > - $self->changeRefKeys($objnode, \%newrefkeys); > + $self->changeRefKeys($objnode, \%newrefkeys, \%traversedNodes); > for my $newkey (values %newrefkeys) > { > - $self->changeRefKeys($self->dereference($newkey), \% > newrefkeys); > + $self->changeRefKeys($self->dereference($newkey), \% > newrefkeys, \%traversedNodes); > } > } > return (%newrefkeys);

Is the point of this to remember what you saw across calls to changeRefKeys? If yes, than I can see where this would probably be a performance boost. Show quoted text

> @@ -5040,7 +5044,7 @@ > if ($stream) > { > $stream = $self->{crypt}->encrypt($self, $stream, $objnode-> > {objnum}, $objnode->{gennum}); > - $str .= "\nstream\n" . $stream . 'endstream'; > + $str .= "\nstream\n" . $stream . '\nendstream'; > } > return "obj\n$str\nendobj\n"; > }

This is wrong. '\n' is equivalent to 'n' and I really don't think you want to append an 'n' to each stream! Show quoted text

> @@ -5072,6 +5076,7 @@ > my $objnode = shift; > my $func = shift; > my $funcdata = shift; > + my $funcResult = undef; > > my $traversed = {}; > my @stack = ($objnode); > @@ -5080,7 +5085,8 @@ > while ($i < @stack) > { > my $objnode = $stack[$i++]; > - $self->$func($objnode, $funcdata); > + $funcResult = undef; > + $funcResult = $self->$func($objnode, $funcdata); > > my $type = $objnode->{type}; > my $val = $objnode->{value};

The above change does nothing. You aren't even using the value you store in $funcResult. Show quoted text

> @@ -5108,6 +5114,62 @@ > return; > } > > +sub recurseTraverse { > + my ($self, $deref, $objnode, $traversedRef, $func, $funcdata, > $objnum) = @_; > + > + my $type; > + my $val; > + my @nodes = (); > + my $node = undef; > + my $newObjNum = undef; > + > + if ((!defined($objnode)) || (! ref $objnode )) { > + return; > + } > + > + if (defined($objnum)) { > + $newObjNum = $objnum; > + } > + > + $type = $objnode->{type}; > + $val = $objnode->{value}; > + > + if (exists $objnode->{objnum}) { > + $newObjNum = $objnode->{objnum}; > + } > + > + if ((defined($newObjNum)) && > + (exists ($traversedRef->{$newObjNum}))) { > + return; > + } else { > + $self->$func($objnode, $funcdata); > + } > + > + if ($type eq 'dictionary') { > + push (@nodes, values %{$val}); > + } elsif ($type eq 'array') { > + push (@nodes, @{$val}); > + } elsif ($type eq 'object') { > + push (@nodes, $val); > + } elsif ($type eq 'reference') { > + if ($deref) { > + push (@nodes, $self->dereference($val)); > + } else { > + return; > + } > + } > + > + for $node (@nodes) { > + recurseTraverse($self, $deref, $node, $traversedRef, $func, > $funcdata, $newObjNum); > + } > + > + if (($type eq 'object')) { > + $traversedRef->{$newObjNum} = 1; > + } > + > + return; > +} > + > # decodeObject and decodeAll differ from each other like this: > # > # decodeObject JUST decodes a single stream directly below the > object

You neglected to document recurseTraverse, so I'm having some trouble understanding the point of this method. It looks like a slower version of my traverse() method, which deliberately uses an explicit stack instead of the program stack for a big performance boost. I don't understand why you check for under or scalar $objnode. Those cases should not even be possible, I believe. Show quoted text

> @@ -5538,10 +5600,11 @@ > my $self = shift; > my $objnode = shift; > my $newrefkeys = shift; > + my $traversedRef = shift; > > my $follow = shift || 0; # almost always false > > - $self->traverse($follow, $objnode, \&_changeRefKeysCB, > $newrefkeys); > + $self->recurseTraverse($follow, $objnode, $traversedRef, > \&_changeRefKeysCB, $newrefkeys, 0); > return; > } >

Show quoted text

> > @@ -5558,9 +5621,10 @@ > if (exists $newrefkeys->{$objnode->{value}}) > { > $objnode->{value} = $newrefkeys->{$objnode->{value}}; > + return 1; > } > } > - return; > + return 0; > } > > =item $doc->abbrevInlineImage($object)

The return value of this function is never used. Why is it returning 1 or 0?

Wed Apr 30 21:03:01 2008 The RT System itself - Status changed from 'new' to 'open'

Wed Apr 30 21:15:15 2008 rt.cpan.org [...] darkart.com - Correspondence added

Subject:	Re: [rt.cpan.org #35555] patch to improve node traversal
Date:	Thu, 1 May 2008 01:14:59 +0000
To:	Chris Dolan via RT <bug-CAM-PDF [...] rt.cpan.org>
From:	Eric Hall <rt.cpan.org [...] darkart.com>

On Wed, Apr 30, 2008 at 09:03:01PM -0400, Chris Dolan via RT wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > > > Hi Eric, > > There are several problems with this patch as written. I've included > criticisms below, which I intend constructively. Clearly you've put > some work into this, but I cannot accept it as is.

Sure, it was done by one of the people who works for me, and (unfortunately after I sent it out) he did express that he wanted to fix it up. Show quoted text

> > Furthermore, it would help if you explained your patch more > thoroughly. In particular, you say that the patch improved > traversal, but you don't say whether it improves performance or > correctness. Is there a problem with the existing traverse routine?

Yes, there was/is a problem, which I'll have to ask what the particulars were as I'm not remembering right now. Sorry, I should have gotten that info before sending it (I wanted to send it out to you while I was working on integrating it with our stuff today, lest I forget). I'll forward your comments and we'll get back to you. I have one right off, about the '\n' added around each stream, that adds a newline, not a 'n'. -eric

Wed Apr 30 22:05:10 2008 chris [...] chrisdolan.net - Correspondence added

Subject:	Re: [rt.cpan.org #35555] patch to improve node traversal
Date:	Wed, 30 Apr 2008 21:04:54 -0500
To:	bug-CAM-PDF [...] rt.cpan.org
From:	Chris Dolan <chris [...] chrisdolan.net>

Thanks for the update. On Apr 30, 2008, at 8:15 PM, Eric Hall via RT wrote: Show quoted text

> I have one right off, about the '\n' added around each stream, > that adds a newline, not a 'n'.

Sorry to disagree, but you're mistaken. The code we're discussing used single quotes and newlines don't interpolate in single quotes. Furthermore, \n is different on Windows vs. Unix. Streams have particularly finicky syntax, so sticking a \n in there would only work in the most lenient PDF parsers. Chris

Thu May 01 15:29:48 2008 rt.cpan.org [...] darkart.com - Correspondence added

Subject:	Re: [rt.cpan.org #35555] patch to improve node traversal
Date:	Thu, 1 May 2008 19:29:22 +0000
To:	Chris Dolan via RT <bug-CAM-PDF [...] rt.cpan.org>
From:	Eric Hall <rt.cpan.org [...] darkart.com>

On Wed, Apr 30, 2008 at 10:05:13PM -0400, Chris Dolan via RT wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > > > Thanks for the update. > > On Apr 30, 2008, at 8:15 PM, Eric Hall via RT wrote:

> > I have one right off, about the '\n' added around each stream, > > that adds a newline, not a 'n'.

> > Sorry to disagree, but you're mistaken. The code we're discussing > used single quotes and newlines don't interpolate in single quotes. > Furthermore, \n is different on Windows vs. Unix. Streams have > particularly finicky syntax, so sticking a \n in there would only > work in the most lenient PDF parsers. >

Chris- Yes, inside of the single quotes it will be left as \n, however empirical testing reveals that it is interpolated elsewhere as we get a newline in the final output. As well, every PDF we've looked at (generated from many different systems) uses 0x0a for newlines, and the PDF references I've read indicate that both CR, LF, and CRLF are all considered as EOL markers. As well, the references indicate that the PDF data should be ...consumed directly, without translation between native character sets, end-of-line representations, or other conventions used on various platforms. (PDFReference 1.3, ISBN 0-201-61588-6) Worded slightly differently in the PDF 1.7 reference: Such a file is not portable to environments that impose reserved character codes, maximum line lengths, end-of-line conventions, or other restrictions. (both are in section 3.1) It doesn't seem that the EOL marker is required in all places, but the 'Lexical Conventions' indicate that the EOL marker before tokens is the recommended way to format the PDF file. -eric

Thu May 01 15:35:57 2008 rt.cpan.org [...] darkart.com - Correspondence added

Subject:	Re: [rt.cpan.org #35555] patch to improve node traversal
Date:	Thu, 1 May 2008 19:35:32 +0000
To:	Chris Dolan via RT <bug-CAM-PDF [...] rt.cpan.org>
From:	Eric Hall <rt.cpan.org [...] darkart.com>

On Wed, Apr 30, 2008 at 09:03:01PM -0400, Chris Dolan via RT wrote: Show quoted text

> > <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > > > Hi Eric, > > There are several problems with this patch as written. I've included > criticisms below, which I intend constructively. Clearly you've put > some work into this, but I cannot accept it as is. > > Furthermore, it would help if you explained your patch more > thoroughly. In particular, you say that the patch improved > traversal, but you don't say whether it improves performance or > correctness. Is there a problem with the existing traverse routine?

Comments from the actual developer in-line below, wrapped in <dev> </dev> tags. For the 'dead' and/or 'useless' code segments, those are entirely my fault, I should have made sure we had a good/final patch to send upstream before doing so, sorry for that. -eric Show quoted text

> > Chris > > On Apr 30, 2008, at 4:36 PM, Eric Hall via RT wrote:

> > > > Wed Apr 30 17:36:27 2008: Request 35555 was acted upon. > > Transaction: Ticket created by rt.cpan.org@darkart.com > > Queue: CAM-PDF > > Subject: patch to improve node traversal > > Broken in: (no value) > > Severity: (no value) > > Owner: Nobody > > Requestors: rt.cpan.org@darkart.com > > Status: new > > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > > > > > > > Hello- > > Attached is a patch against 1.13 which improved PDF node > > traversal for us. > > > > > > -eric > > > > > > > > --- lib/CAM/PDF.pm.orig 2007-11-28 21:42:46.000000000 -0800 > > +++ lib/CAM/PDF.pm 2008-03-27 20:06:54.000000000 -0700 > > @@ -566,6 +566,9 @@ > > $CAM::PDF::errstr = "Could not decipher xref row:\n" . > > $self->trimstr($row); > > return; > > } > > + if ((0 == $1) && (0 == $2)) { > > + next; > > + } > > if ($type eq 'n') > > { > > $index->{$objnum} = $indexnum;

> > > This addition is dead code. The case of a zero indexnum is already > handled in the conditional above this.

<dev> That's something that changed between 1.07 and 1.12...so it is dead code now. In v1.07 it certainly was NOT dead code. However, I don't believe we should quit the function when we find a zero index number...we just need to skip that row. Killing the whole process for that doesn't seem necessary for an otherwise completely parseable PDF. </dev> Show quoted text

> >

> > @@ -3785,6 +3788,7 @@ > > my $otherdoc = shift; > > my $otherkey = shift; > > my $follow = shift; > > + my %traversedNodes = (); > > > > # careful! 'undef' means something different from '0' here! > > if (!defined $follow) > > @@ -3838,10 +3842,10 @@ > > my $newkey = $self->appendObject($otherdoc, $oldrefkey, 0); > > $newrefkeys{$oldrefkey} = $newkey; > > } > > - $self->changeRefKeys($objnode, \%newrefkeys); > > + $self->changeRefKeys($objnode, \%newrefkeys, \%traversedNodes); > > for my $newkey (values %newrefkeys) > > { > > - $self->changeRefKeys($self->dereference($newkey), \% > > newrefkeys); > > + $self->changeRefKeys($self->dereference($newkey), \% > > newrefkeys, \%traversedNodes); > > } > > } > > return (%newrefkeys);

> > Is the point of this to remember what you saw across calls to > changeRefKeys? If yes, than I can see where this would probably be a > performance boost.

<dev> Of all the changes in the patch this one is the most vital. When appending PDF files you need to renumber all the indices, and if you descend into the hierarchy without keeping track of where you've been, you'll end up with misnumbered indices and your resultant PDF is corrupt. "traverse" does keeps track of traversedRefs, but the way it's called the vital traversal data is lost every time the function exits, so nodes are getting traversed multiple times. This was not meant as a performance boost but rather making it work correctly, however it does prevent multiple traversal of nodes which is a perf gain. To roll that into traverse, just pass a reference to %traversed in rather than defining it in the function. </dev> Show quoted text

>

> > @@ -5040,7 +5044,7 @@ > > if ($stream) > > { > > $stream = $self->{crypt}->encrypt($self, $stream, $objnode-> > > {objnum}, $objnode->{gennum}); > > - $str .= "\nstream\n" . $stream . 'endstream'; > > + $str .= "\nstream\n" . $stream . '\nendstream'; > > } > > return "obj\n$str\nendobj\n"; > > }

> > > This is wrong. '\n' is equivalent to 'n' and I really don't think > you want to append an 'n' to each stream!

<dev> Two comments: 1) that's correct...the single quoted \n will come through as a literal string \n. However, it does work, and I believe that's because in other places the $stream that gets returned ends up embedding inside of double quoted strings so it gets turned into a newline there. 2) in every PDF I've ever opened, there is a newline between the end of the stream and the "endstream" marker, so I do think that there needs to be a newline there. The single quotes should be changed to double quotes. </dev> Show quoted text

> >

> > @@ -5072,6 +5076,7 @@ > > my $objnode = shift; > > my $func = shift; > > my $funcdata = shift; > > + my $funcResult = undef; > > > > my $traversed = {}; > > my @stack = ($objnode); > > @@ -5080,7 +5085,8 @@ > > while ($i < @stack) > > { > > my $objnode = $stack[$i++]; > > - $self->$func($objnode, $funcdata); > > + $funcResult = undef; > > + $funcResult = $self->$func($objnode, $funcdata); > > > > my $type = $objnode->{type}; > > my $val = $objnode->{value};

> > The above change does nothing. You aren't even using the value you > store in $funcResult.

<dev> Yeah, that's a bad merge. There was code that used funcResult, but it was commented out and therefore didn't show up in the diff. Before writing recurseTraverse I was trying to get traverse to work. </dev> Show quoted text

>

> > @@ -5108,6 +5114,62 @@ > > return; > > } > > > > +sub recurseTraverse { > > + my ($self, $deref, $objnode, $traversedRef, $func, $funcdata, > > $objnum) = @_; > > + > > + my $type; > > + my $val; > > + my @nodes = (); > > + my $node = undef; > > + my $newObjNum = undef; > > + > > + if ((!defined($objnode)) || (! ref $objnode )) { > > + return; > > + } > > + > > + if (defined($objnum)) { > > + $newObjNum = $objnum; > > + } > > + > > + $type = $objnode->{type}; > > + $val = $objnode->{value}; > > + > > + if (exists $objnode->{objnum}) { > > + $newObjNum = $objnode->{objnum}; > > + } > > + > > + if ((defined($newObjNum)) && > > + (exists ($traversedRef->{$newObjNum}))) { > > + return; > > + } else { > > + $self->$func($objnode, $funcdata); > > + } > > + > > + if ($type eq 'dictionary') { > > + push (@nodes, values %{$val}); > > + } elsif ($type eq 'array') { > > + push (@nodes, @{$val}); > > + } elsif ($type eq 'object') { > > + push (@nodes, $val); > > + } elsif ($type eq 'reference') { > > + if ($deref) { > > + push (@nodes, $self->dereference($val)); > > + } else { > > + return; > > + } > > + } > > + > > + for $node (@nodes) { > > + recurseTraverse($self, $deref, $node, $traversedRef, $func, > > $funcdata, $newObjNum); > > + } > > + > > + if (($type eq 'object')) { > > + $traversedRef->{$newObjNum} = 1; > > + } > > + > > + return; > > +} > > + > > # decodeObject and decodeAll differ from each other like this: > > # > > # decodeObject JUST decodes a single stream directly below the > > object

> > > You neglected to document recurseTraverse, so I'm having some trouble > understanding the point of this method. It looks like a slower > version of my traverse() method, which deliberately uses an explicit > stack instead of the program stack for a big performance boost. I > don't understand why you check for under or scalar $objnode. Those > cases should not even be possible, I believe.

<dev> I didn't get a chance to document it...I didn't know this was being submitted to CPAN. :) The story is that I was having trouble with traverse hitting nodes multiple times, and the whole stack thing was a bit confusing. I wrote a recursive traversal just for sanity's sake so I could easily see the hierarchy being traversed. After I figured out that the main issue was just that the traversedRef was getting reset, I didn't get a chance to go back and make the change to the traverse function and test it. It may be slower, but it doesn't seem to make that much difference; I've used it to stitch together 30 parts of a PDF and it performed fine. </dev> Show quoted text

> >

> > @@ -5538,10 +5600,11 @@ > > my $self = shift; > > my $objnode = shift; > > my $newrefkeys = shift; > > + my $traversedRef = shift; > > > > my $follow = shift || 0; # almost always false > > > > - $self->traverse($follow, $objnode, \&_changeRefKeysCB, > > $newrefkeys); > > + $self->recurseTraverse($follow, $objnode, $traversedRef, > > \&_changeRefKeysCB, $newrefkeys, 0); > > return; > > } > >

> > >

> > > > @@ -5558,9 +5621,10 @@ > > if (exists $newrefkeys->{$objnode->{value}}) > > { > > $objnode->{value} = $newrefkeys->{$objnode->{value}}; > > + return 1; > > } > > } > > - return; > > + return 0; > > } > > > > =item $doc->abbrevInlineImage($object)

> > The return value of this function is never used. Why is it returning > 1 or 0? >

<dev> This change goes with the other unused change above: + $funcResult = $self->$func($objnode, $funcdata); It's also an unnecessary merge. You're right, it does nothing. </dev>

Thu May 01 21:40:34 2008 chris [...] chrisdolan.net - Correspondence added

Subject:	Re: [rt.cpan.org #35555] patch to improve node traversal
Date:	Thu, 1 May 2008 20:40:05 -0500
To:	bug-CAM-PDF [...] rt.cpan.org
From:	Chris Dolan <chris [...] chrisdolan.net>

Eric, Thanks for the detailed response, <dev>! The explanation makes sense, but I'm surprised I never encountered the bug before. I rewrote the patch to be much less intrusive, and attached it to this message. I haven't tested it except to ensure that the CAM::PDF unit tests still pass. Could you please try it against 1.13 and see if it solves your problems? Even better, I would be THRILLED if you were to write a simple test case that exposed the bug in 1.13. Thanks, Chris

Message body is not shown because sender requested not to inline it.

On May 1, 2008, at 2:35 PM, Eric Hall via RT wrote: Show quoted text

> > Queue: CAM-PDF > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > > > On Wed, Apr 30, 2008 at 09:03:01PM -0400, Chris Dolan via RT wrote:

>> >> <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > >> >> Hi Eric, >> >> There are several problems with this patch as written. I've included >> criticisms below, which I intend constructively. Clearly you've put >> some work into this, but I cannot accept it as is. >> >> Furthermore, it would help if you explained your patch more >> thoroughly. In particular, you say that the patch improved >> traversal, but you don't say whether it improves performance or >> correctness. Is there a problem with the existing traverse routine?

> > Comments from the actual developer in-line below, wrapped in > <dev> </dev> tags. > > For the 'dead' and/or 'useless' code segments, those are > entirely my fault, I should have made sure we had a good/final patch > to send upstream before doing so, sorry for that. > > > -eric > >

>> >> Chris >> >> On Apr 30, 2008, at 4:36 PM, Eric Hall via RT wrote:

>>> >>> Wed Apr 30 17:36:27 2008: Request 35555 was acted upon. >>> Transaction: Ticket created by rt.cpan.org@darkart.com >>> Queue: CAM-PDF >>> Subject: patch to improve node traversal >>> Broken in: (no value) >>> Severity: (no value) >>> Owner: Nobody >>> Requestors: rt.cpan.org@darkart.com >>> Status: new >>> Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=35555 > >>> >>> >>> Hello- >>> Attached is a patch against 1.13 which improved PDF node >>> traversal for us. >>> >>> >>> -eric >>> >>> >>> >>> --- lib/CAM/PDF.pm.orig 2007-11-28 21:42:46.000000000 -0800 >>> +++ lib/CAM/PDF.pm 2008-03-27 20:06:54.000000000 -0700 >>> @@ -566,6 +566,9 @@ >>> $CAM::PDF::errstr = "Could not decipher xref row:\n" . >>> $self->trimstr($row); >>> return; >>> } >>> + if ((0 == $1) && (0 == $2)) { >>> + next; >>> + } >>> if ($type eq 'n') >>> { >>> $index->{$objnum} = $indexnum;

>> >> >> This addition is dead code. The case of a zero indexnum is already >> handled in the conditional above this.

> > <dev> > That's something that changed between 1.07 and 1.12...so it is dead > code now. In v1.07 it certainly was NOT dead code. However, I > don't believe we should quit the function when we find a zero index > number...we just need to skip that row. Killing the whole process > for that doesn't seem necessary for an otherwise completely > parseable PDF. > </dev> > >

>> >>

>>> @@ -3785,6 +3788,7 @@ >>> my $otherdoc = shift; >>> my $otherkey = shift; >>> my $follow = shift; >>> + my %traversedNodes = (); >>> >>> # careful! 'undef' means something different from '0' here! >>> if (!defined $follow) >>> @@ -3838,10 +3842,10 @@ >>> my $newkey = $self->appendObject($otherdoc, $oldrefkey, >>> 0); >>> $newrefkeys{$oldrefkey} = $newkey; >>> } >>> - $self->changeRefKeys($objnode, \%newrefkeys); >>> + $self->changeRefKeys($objnode, \%newrefkeys, \% >>> traversedNodes); >>> for my $newkey (values %newrefkeys) >>> { >>> - $self->changeRefKeys($self->dereference($newkey), \% >>> newrefkeys); >>> + $self->changeRefKeys($self->dereference($newkey), \% >>> newrefkeys, \%traversedNodes); >>> } >>> } >>> return (%newrefkeys);

>> >> Is the point of this to remember what you saw across calls to >> changeRefKeys? If yes, than I can see where this would probably be a >> performance boost.

> > <dev> > Of all the changes in the patch this one is the most vital. When > appending PDF files you need to renumber all the indices, and if > you descend into the hierarchy without keeping track of where > you've been, you'll end up with misnumbered indices and your > resultant PDF is corrupt. "traverse" does keeps track of > traversedRefs, but the way it's called the vital traversal data is > lost every time the function exits, so nodes are getting traversed > multiple times. This was not meant as a performance boost but > rather making it work correctly, however it does prevent multiple > traversal of nodes which is a perf gain. > > To roll that into traverse, just pass a reference to %traversed in > rather than defining it in the function. > </dev> > >

>>

>>> @@ -5040,7 +5044,7 @@ >>> if ($stream) >>> { >>> $stream = $self->{crypt}->encrypt($self, $stream, $objnode-> >>> {objnum}, $objnode->{gennum}); >>> - $str .= "\nstream\n" . $stream . 'endstream'; >>> + $str .= "\nstream\n" . $stream . '\nendstream'; >>> } >>> return "obj\n$str\nendobj\n"; >>> }

>> >> >> This is wrong. '\n' is equivalent to 'n' and I really don't think >> you want to append an 'n' to each stream!

> > <dev> > Two comments: > 1) that's correct...the single quoted \n will come through as a > literal string \n. However, it does work, and I believe that's > because in other places the $stream that gets returned ends up > embedding inside of double quoted strings so it gets turned into a > newline there. > > 2) in every PDF I've ever opened, there is a newline between the > end of the stream and the "endstream" marker, so I do think that > there needs to be a newline there. The single quotes should be > changed to double quotes. > </dev> > >

>> >>

>>> @@ -5072,6 +5076,7 @@ >>> my $objnode = shift; >>> my $func = shift; >>> my $funcdata = shift; >>> + my $funcResult = undef; >>> >>> my $traversed = {}; >>> my @stack = ($objnode); >>> @@ -5080,7 +5085,8 @@ >>> while ($i < @stack) >>> { >>> my $objnode = $stack[$i++]; >>> - $self->$func($objnode, $funcdata); >>> + $funcResult = undef; >>> + $funcResult = $self->$func($objnode, $funcdata); >>> >>> my $type = $objnode->{type}; >>> my $val = $objnode->{value};

>> >> The above change does nothing. You aren't even using the value you >> store in $funcResult.

> > <dev> > Yeah, that's a bad merge. There was code that used funcResult, but > it was commented out and therefore didn't show up in the diff. > Before writing recurseTraverse I was trying to get traverse to work. > </dev> >

>>

>>> @@ -5108,6 +5114,62 @@ >>> return; >>> } >>> >>> +sub recurseTraverse { >>> + my ($self, $deref, $objnode, $traversedRef, $func, $funcdata, >>> $objnum) = @_; >>> + >>> + my $type; >>> + my $val; >>> + my @nodes = (); >>> + my $node = undef; >>> + my $newObjNum = undef; >>> + >>> + if ((!defined($objnode)) || (! ref $objnode )) { >>> + return; >>> + } >>> + >>> + if (defined($objnum)) { >>> + $newObjNum = $objnum; >>> + } >>> + >>> + $type = $objnode->{type}; >>> + $val = $objnode->{value}; >>> + >>> + if (exists $objnode->{objnum}) { >>> + $newObjNum = $objnode->{objnum}; >>> + } >>> + >>> + if ((defined($newObjNum)) && >>> + (exists ($traversedRef->{$newObjNum}))) { >>> + return; >>> + } else { >>> + $self->$func($objnode, $funcdata); >>> + } >>> + >>> + if ($type eq 'dictionary') { >>> + push (@nodes, values %{$val}); >>> + } elsif ($type eq 'array') { >>> + push (@nodes, @{$val}); >>> + } elsif ($type eq 'object') { >>> + push (@nodes, $val); >>> + } elsif ($type eq 'reference') { >>> + if ($deref) { >>> + push (@nodes, $self->dereference($val)); >>> + } else { >>> + return; >>> + } >>> + } >>> + >>> + for $node (@nodes) { >>> + recurseTraverse($self, $deref, $node, $traversedRef, $func, >>> $funcdata, $newObjNum); >>> + } >>> + >>> + if (($type eq 'object')) { >>> + $traversedRef->{$newObjNum} = 1; >>> + } >>> + >>> + return; >>> +} >>> + >>> # decodeObject and decodeAll differ from each other like this: >>> # >>> # decodeObject JUST decodes a single stream directly below the >>> object

>> >> >> You neglected to document recurseTraverse, so I'm having some trouble >> understanding the point of this method. It looks like a slower >> version of my traverse() method, which deliberately uses an explicit >> stack instead of the program stack for a big performance boost. I >> don't understand why you check for under or scalar $objnode. Those >> cases should not even be possible, I believe.

> > <dev> > I didn't get a chance to document it...I didn't know this was being > submitted to CPAN. :) The story is that I was having trouble with > traverse hitting nodes multiple times, and the whole stack thing > was a bit confusing. I wrote a recursive traversal just for > sanity's sake so I could easily see the hierarchy being traversed. > > After I figured out that the main issue was just that the > traversedRef was getting reset, I didn't get a chance to go back > and make the change to the traverse function and test it. > > It may be slower, but it doesn't seem to make that much difference; > I've used it to stitch together 30 parts of a PDF and it performed > fine. > </dev> >

>> >>

>>> @@ -5538,10 +5600,11 @@ >>> my $self = shift; >>> my $objnode = shift; >>> my $newrefkeys = shift; >>> + my $traversedRef = shift; >>> >>> my $follow = shift || 0; # almost always false >>> >>> - $self->traverse($follow, $objnode, \&_changeRefKeysCB, >>> $newrefkeys); >>> + $self->recurseTraverse($follow, $objnode, $traversedRef, >>> \&_changeRefKeysCB, $newrefkeys, 0); >>> return; >>> } >>>

>> >> >>

>>> >>> @@ -5558,9 +5621,10 @@ >>> if (exists $newrefkeys->{$objnode->{value}}) >>> { >>> $objnode->{value} = $newrefkeys->{$objnode->{value}}; >>> + return 1; >>> } >>> } >>> - return; >>> + return 0; >>> } >>> >>> =item $doc->abbrevInlineImage($object)

>> >> The return value of this function is never used. Why is it returning >> 1 or 0? >>

> > <dev> > This change goes with the other unused change above: > > + $funcResult = $self->$func($objnode, $funcdata); > > It's also an unnecessary merge. You're right, it does nothing. > </dev> > > >

Thu Jul 31 04:12:43 2008 chris+rt [...] chrisdolan.net - Status changed from 'open' to 'resolved'