Subject: | A patch to "downgrade" the q - Q pair of operators from "block forming" status |
The current CAM::PDF logic of page content parsing assumes operators which come in pairs are always properly nested, so they form so called "blocks". E.g. the PDF Reference explicitly prescribes this for BT - ET and BDC - EMC pairs.
However, nowhere in the Reference it is said that q - Q and BDC - EMC pairs should be similarly properly nested. True, one possible reading might be that content bracketed between BDC - EMC is somehow self-sufficient -- "as a group to be processed as a single unit". But that probably does not imply pairs nesting.
Some (most) PDF files I'm dealing with do not have q - Q and BDC- EMC pairs nested. Ironically or not, they are Adobe software produced files, and in my bubble they are 99% of PDF files. Maybe it affects other users as well.
In result, e.g. text extraction with CAM::PDF is impossible:
getpdftext -v pdf_17_errata.pdf 1
Wrong block ending (expected 'Q', got 'EMC') at C:/Strawberry/perl/site/lib/CAM/PDF.pm line 2508.
Parse failed at C:/Strawberry/perl/site/lib/CAM/PDF.pm line 2508.
(File is https://wwwimages2.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/pdf_17_errata.pdf -- more or less random google search result).
With proposed patch, this problem is gone.
Subject: | diff2.txt |
--- Content.pm.old Thu Aug 15 06:08:26 2013
+++ Content.pm Wed May 31 17:45:46 2017
@@ -47,7 +47,7 @@
my %loaded; # keep track of eval'd renderers
my %endings = (
- q => 'Q',
+# q => 'Q',
BT => 'ET',
BDC => 'EMC',
BMC => 'EMC',
@@ -173,6 +173,7 @@
content => $content,
blocks => [],
verbose => $verbose,
+ GS_stack => [],
}, $pkg;
return $self->parse(\$content);
}
@@ -524,6 +525,13 @@
}
$gs = $newgs;
+ }
+
+ if ( $block-> { type } eq 'op' and $block-> { name } eq 'q' ) {
+ push @{ $self-> { GS_stack }}, $gs-> clone;
+ }
+ if ( $block-> { type } eq 'op' and $block-> { name } eq 'Q' ) {
+ $gs = $block-> { gs } = pop @{ $self-> { GS_stack }}
}
if ($block->{type} eq 'block')