Skip Menu |

This queue is for tickets about the CAM-PDF CPAN distribution.

Report information
The Basics
Id: 121949
Status: open
Priority: 0/
Queue: CAM-PDF

People
Owner: Nobody in particular
Requestors: futuramedium [...] yandex.ru
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: A patch to "downgrade" the q - Q pair of operators from "block forming" status
The current CAM::PDF logic of page content parsing assumes operators which come in pairs are always properly nested, so they form so called "blocks". E.g. the PDF Reference explicitly prescribes this for BT - ET and BDC - EMC pairs. However, nowhere in the Reference it is said that q - Q and BDC - EMC pairs should be similarly properly nested. True, one possible reading might be that content bracketed between BDC - EMC is somehow self-sufficient -- "as a group to be processed as a single unit". But that probably does not imply pairs nesting. Some (most) PDF files I'm dealing with do not have q - Q and BDC- EMC pairs nested. Ironically or not, they are Adobe software produced files, and in my bubble they are 99% of PDF files. Maybe it affects other users as well. In result, e.g. text extraction with CAM::PDF is impossible: getpdftext -v pdf_17_errata.pdf 1 Wrong block ending (expected 'Q', got 'EMC') at C:/Strawberry/perl/site/lib/CAM/PDF.pm line 2508. Parse failed at C:/Strawberry/perl/site/lib/CAM/PDF.pm line 2508. (File is https://wwwimages2.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/pdf_17_errata.pdf -- more or less random google search result). With proposed patch, this problem is gone.
Subject: diff2.txt
--- Content.pm.old Thu Aug 15 06:08:26 2013 +++ Content.pm Wed May 31 17:45:46 2017 @@ -47,7 +47,7 @@ my %loaded; # keep track of eval'd renderers my %endings = ( - q => 'Q', +# q => 'Q', BT => 'ET', BDC => 'EMC', BMC => 'EMC', @@ -173,6 +173,7 @@ content => $content, blocks => [], verbose => $verbose, + GS_stack => [], }, $pkg; return $self->parse(\$content); } @@ -524,6 +525,13 @@ } $gs = $newgs; + } + + if ( $block-> { type } eq 'op' and $block-> { name } eq 'q' ) { + push @{ $self-> { GS_stack }}, $gs-> clone; + } + if ( $block-> { type } eq 'op' and $block-> { name } eq 'Q' ) { + $gs = $block-> { gs } = pop @{ $self-> { GS_stack }} } if ($block->{type} eq 'block')
From: raherh [...] gmail.com
On Wed May 31 10:59:00 2017, vadimr wrote: Show quoted text
> The current CAM::PDF logic of page content parsing assumes operators > which come in pairs are always properly nested, so they form so called > "blocks". E.g. the PDF Reference explicitly prescribes this for BT - > ET and BDC - EMC pairs. > > However, nowhere in the Reference it is said that q - Q and BDC - EMC > pairs should be similarly properly nested. True, one possible reading > might be that content bracketed between BDC - EMC is somehow self- > sufficient -- "as a group to be processed as a single unit". But that > probably does not imply pairs nesting. > > Some (most) PDF files I'm dealing with do not have q - Q and BDC- EMC > pairs nested. Ironically or not, they are Adobe software produced > files, and in my bubble they are 99% of PDF files. Maybe it affects > other users as well. > > In result, e.g. text extraction with CAM::PDF is impossible: > > getpdftext -v pdf_17_errata.pdf 1 > Wrong block ending (expected 'Q', got 'EMC') at > C:/Strawberry/perl/site/lib/CAM/PDF.pm line 2508. > Parse failed at C:/Strawberry/perl/site/lib/CAM/PDF.pm line 2508. > > (File is > https://wwwimages2.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/pdf_17_errata.pdf > -- more or less random google search result). > > With proposed patch, this problem is gone.
I was about to report a bug returning Wrong block ending (expected 'Q', got 'EMC'). Fortunately I noticed your fix which works excelently. Thank you very much.