Subject: | CAM::PDF Error Extracting Text From a PDF Page |
Date: | Tue, 11 Jun 2013 23:41:21 -0400 |
To: | bug-CAM-PDF [...] rt.cpan.org |
From: | Hal Weitzman <haroldweitzman [...] gmail.com> |
Hi Chris,
I am using CAM::PDF to loop page by page through a PDF document looking
to exclude the PDF based on existence of one of a set of specific phrases.
Many PDFs give no trouble. However, recently I encountered several PDFs
where the following error occurs and halts my script:
"DecodeParms must be a dictionary"
I used the eval function to trap the error and allow the script to
continue. Here is the PDF scan section of my code:
my $pdf = CAM::PDF->new($path_to_temp . $title); # Init an object
my $pages = $pdf->numPages; # get
pages to search
$ii = 1; # init the index to PDF pages
while ($ii < $pages) { # loop
while more pages
print "\n Get page $ii text ";
# for testing
eval { # catch the error
$PageText = $pdf->getPageText($ii); #
get the current page text
}; # end of eval block
if ($@) {# check for error
print "\n Get page $ii text failed -> $@ "; #
inform the log
next; # skip to the next page
} # end of error check
The rest of the code searches the current page text for an exclude
phrase and performs the required action.
Here is the log output:
Row 1 SPE4A713Q5923.PDF
Get page 1 text
Get page 1 text failed -> DecodeParms must be a dictionary.
Get page 2 text
Get page 3 text
Get page 4 text
Get page 5 text
Get page 6 text
Get page 7 text
Get page 8 text
Get page 9 text
Get page 10 text
Get page 11 text
Get page 12 text
Get page 13 text
Get page 14 text
Get page 15 text
Get page 16 text found INSPECTION POINT: ORIGIN --> Skipped
All these PDFs fail on page 1.
I have attached the PDF that generated this log. Ihope it is not too large.
I am using Padre 0.98, Perl 5.14.2 and CAM::PDF 1.59. My OS is Win7
Ultimate.
The PDF is downloaded from the web using WWW::Mechanize::Firefox
(version 0.74) to a temp directory then loaded into a new CAM::PDF object.
(It would be nice to beable to download the PDF directly into CAM::PDF)
I don't know if this is a PDF version issue (is there a way to get the
version of a PDF?) , a bug or, maybe, a preferences setting.
Thank you for making this module available and for your continued support.
--
Regards
Hal Weitzman
haroldweitzman@gmail.com
Cell: 609-217-0088
Message body is not shown because it is too large.
Message body not shown because it is not plain text.