Skip Menu |

This queue is for tickets about the CAM-PDF CPAN distribution.

Report information
The Basics
Id: 65150
Status: rejected
Priority: 0/
Queue: CAM-PDF

People
Owner: Nobody in particular
Requestors: andreas.gaertner [...] mni.fh-giessen.de
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: Report: strings not found in a few PDFs
Date: Wed, 26 Jan 2011 12:30:57 +0100
To: bug-CAM-PDF [...] rt.cpan.org
From: "Andreas Gärtner" <andreas.gaertner [...] mni.fh-giessen.de>
Dear CAM-PDF Team, I am contacting you regarding your PERL-Module CAM::PDF. I use this modul to find special strings in PDF files. After checking the results with MendeleyDesktops I found a few differences which I would like to submit. 1. PDF "balkwill-2001-lancet-infection-cancer" Searched string: posttranscriptional 2. PDF "berinstein-2007-vacc-toll-cancer" Searched string: rituximab Both strings could not be found with my CAM::PDF Script. I will attach the PDFs and my PERL-Script to this eMail. ( call of perl skript: perl PDF-APPROX-MATCH.PL search=string metaphone=0 editdist=0 file=filename.pdf ) Yours truly, Andreas Gärtner

Message body not shown because it is not plain text.

Message body not shown because it is not plain text.

Message body is not shown because sender requested not to inline it.

Hi Andreas, The getPageText() method is intentionally not robust. As stated here http://search.cpan.org/dist/CAM-PDF/lib/CAM/PDF/PageText.pm there are hundreds of ways the code can be fooled by layout variations in a PDF doc. There are much better tools than CAM::PDF for this problem, but I assert that none of them will work on ALL pdfs, not even Acrobat's text searching. Chris