Skip Menu |

This queue is for tickets about the PDF-OCR2 CPAN distribution.

Report information
The Basics
Id: 47129
Status: resolved
Worked: 4 hours (240 min)
Priority: 0/
Queue: PDF-OCR2

People
Owner: Nobody in particular
Requestors: robert.waters [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: Important
Broken in:
  • 1.09
  • 1.10
  • 1.11
  • 1.12
  • 1.13
Fixed in: 1.19



Subject: ->text() fails in xpdf, entire script dies
Date: Thu, 18 Jun 2009 18:30:44 -0400
To: bug-pdf-ocr2 [...] rt.cpan.org
From: R M Waters <robert.waters [...] gmail.com>
Is there a way to check for xpdf errors, rather than have the program die because of them? I am currently iterating through a directory of pdfs, OCRing each. I recv the following errors on a call to ->text for a certain pdf: "CCIT somethingsomething"x 2-10 (this only happens sometimes) "invalid stream" x 2 "bad args for pdf images" x 1 and then my loop stops and my script dies. I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the problem is occurring in the xpdf library (so it seems to me). I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. I would love to be able to use a construct like this: foreach $file (@files){ my $ocr = $pdf->text($file) or next; # OR # if (!defined $ocr) {next;} # OR # if(xpdf_test($file)) {my $ocr ... } } I am currently implementing a blacklist array (a jail for known offenders), but have several thousands of pdfs to run (hopefully there are only a few problem documents). Thank you for the awesome libraries. Robert Waters
Subject: Re: [rt.cpan.org #47129] ->text() fails in xpdf, entire script dies
Date: Thu, 18 Jun 2009 19:07:49 -0400
To: bug-PDF-OCR2 [...] rt.cpan.org
From: leo charre <leocharre [...] gmail.com>
Yes. It is to run it via PDF::API2, it might have to be caught by an eval... http://search.cpan.org/~leocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS Also see: http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-1.13/bin/pdfocrtest The code is basically: my $instance; eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' If you have a script, you can test the pdf before you try to do something else to it. I need to write a little more on this matter. As it turns out, PDF::API2 may be finicky about reading some pdfs it deems to have bad xref tables.. And pdftk can "fix" tables- but then- you are altering the pdf, which I want to stay away from. I hope this helps for now. On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote: Show quoted text
> > Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > Transaction: Ticket created by robert.waters@gmail.com > Queue: PDF-OCR2 > Subject: ->text() fails in xpdf, entire script dies > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: robert.waters@gmail.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > Is there a way to check for xpdf errors, rather than have the program die > because of them? > I am currently iterating through a directory of pdfs, OCRing each. > I recv the following errors on a call to ->text for a certain pdf: > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > "invalid stream" x 2 > "bad args for pdf images" x 1 > > and then my loop stops and my script dies. > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > problem is occurring in the xpdf library (so it seems to me). > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > I would love to be able to use a construct like this: > foreach $file (@files){ > my $ocr = $pdf->text($file) or next; > # OR # > if (!defined $ocr) {next;} > # OR # > if(xpdf_test($file)) {my $ocr ... } > } > > I am currently implementing a blacklist array (a jail for known offenders), > but have several thousands of pdfs to run (hopefully there are only a few > problem documents). > > Thank you for the awesome libraries. > Robert Waters > > > Is there a way to check for xpdf errors, rather than have the program die > because of them? > I am currently iterating through a directory of pdfs, OCRing each. > I recv the following errors on a call to ->text for a certain pdf: > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > "invalid stream" x 2 > "bad args for pdf images" x 1 > > and then my loop stops and my script dies. > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > problem is occurring in the xpdf library (so it seems to me). > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > I would love to be able to use a construct like this: > foreach $file (@files){ > my $ocr = $pdf->text($file) or next; > # OR # > if (!defined $ocr) {next;} > # OR # > if(xpdf_test($file)) {my $ocr ... } > } > > I am currently implementing a blacklist array (a jail for known offenders), > but have several thousands of pdfs to run (hopefully there are only a few > problem documents). > > Thank you for the awesome libraries. > Robert Waters > >
-- Leo Charre
Subject: Re: [rt.cpan.org #47129] ->text() fails in xpdf, entire script dies
Date: Thu, 18 Jun 2009 20:08:08 -0400
To: bug-PDF-OCR2 [...] rt.cpan.org
From: R M Waters <robert.waters [...] gmail.com>
Thank you so much. I've been able to get it "working" by wrapping the call to ->text() in an eval block. It still spits out errors to stderr, but it is generating text. Just as an FYI, I am getting the following information logged to the shell: glibc detected free(): invalid next size (normal) 0xsomethingsomething I wouldnt even mention it but I am redirecting stderr to file, and so am surprised to see it. Thanks for all your help. -Robert Waters On Thu, Jun 18, 2009 at 7:08 PM, leo charre via RT <bug-PDF-OCR2@rt.cpan.org Show quoted text
> wrote:
Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > eval... > > http://search.cpan.org/~leocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS> > > Also see: > http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-1.13/bin/pdfocrtest > > > > The code is basically: > > my $instance; > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > If you have a script, you can test the pdf before you try to do something > else to it. > > I need to write a little more on this matter. > As it turns out, PDF::API2 may be finicky about reading some pdfs it deems > to have bad xref tables.. > And pdftk can "fix" tables- but then- you are altering the pdf, which I > want > to stay away from. > I hope this helps for now. > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:
> > > > Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > > Transaction: Ticket created by robert.waters@gmail.com > > Queue: PDF-OCR2 > > Subject: ->text() fails in xpdf, entire script dies > > Broken in: (no value) > > Severity: (no value) > > Owner: Nobody > > Requestors: robert.waters@gmail.com > > Status: new > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > > > > Is there a way to check for xpdf errors, rather than have the program die > > because of them? > > I am currently iterating through a directory of pdfs, OCRing each. > > I recv the following errors on a call to ->text for a certain pdf: > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > "invalid stream" x 2 > > "bad args for pdf images" x 1 > > > > and then my loop stops and my script dies. > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > > problem is occurring in the xpdf library (so it seems to me). > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > I would love to be able to use a construct like this: > > foreach $file (@files){ > > my $ocr = $pdf->text($file) or next; > > # OR # > > if (!defined $ocr) {next;} > > # OR # > > if(xpdf_test($file)) {my $ocr ... } > > } > > > > I am currently implementing a blacklist array (a jail for known
> offenders),
> > but have several thousands of pdfs to run (hopefully there are only a few > > problem documents). > > > > Thank you for the awesome libraries. > > Robert Waters > > > > > > Is there a way to check for xpdf errors, rather than have the program die > > because of them? > > I am currently iterating through a directory of pdfs, OCRing each. > > I recv the following errors on a call to ->text for a certain pdf: > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > "invalid stream" x 2 > > "bad args for pdf images" x 1 > > > > and then my loop stops and my script dies. > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > > problem is occurring in the xpdf library (so it seems to me). > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > I would love to be able to use a construct like this: > > foreach $file (@files){ > > my $ocr = $pdf->text($file) or next; > > # OR # > > if (!defined $ocr) {next;} > > # OR # > > if(xpdf_test($file)) {my $ocr ... } > > } > > > > I am currently implementing a blacklist array (a jail for known
> offenders),
> > but have several thousands of pdfs to run (hopefully there are only a few > > problem documents). > > > > Thank you for the awesome libraries. > > Robert Waters > > > >
> > > -- > Leo Charre > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > eval... > > http://search.cpan.org/~leocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS> > > Also see: > http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-1.13/bin/pdfocrtest > > > > The code is basically: > > my $instance; > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > If you have a script, you can test the pdf before you try to do something > else to it. > > I need to write a little more on this matter. > As it turns out, PDF::API2 may be finicky about reading some pdfs it deems > to have bad xref tables.. > And pdftk can "fix" tables- but then- you are altering the pdf, which I > want to stay away from. > I hope this helps for now. > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:
>> >> Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. >> Transaction: Ticket created by robert.waters@gmail.com >> Queue: PDF-OCR2 >> Subject: ->text() fails in xpdf, entire script dies >> Broken in: (no value) >> Severity: (no value) >> Owner: Nobody >> Requestors: robert.waters@gmail.com >> Status: new >> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > >> >> >> Is there a way to check for xpdf errors, rather than have the program die >> because of them? >> I am currently iterating through a directory of pdfs, OCRing each. >> I recv the following errors on a call to ->text for a certain pdf: >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) >> "invalid stream" x 2 >> "bad args for pdf images" x 1 >> >> and then my loop stops and my script dies. >> >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the >> problem is occurring in the xpdf library (so it seems to me). >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. >> >> I would love to be able to use a construct like this: >> foreach $file (@files){ >> my $ocr = $pdf->text($file) or next; >> # OR # >> if (!defined $ocr) {next;} >> # OR # >> if(xpdf_test($file)) {my $ocr ... } >> } >> >> I am currently implementing a blacklist array (a jail for known >> offenders), >> but have several thousands of pdfs to run (hopefully there are only a few >> problem documents). >> >> Thank you for the awesome libraries. >> Robert Waters >> >> >> Is there a way to check for xpdf errors, rather than have the program die >> because of them? >> I am currently iterating through a directory of pdfs, OCRing each. >> I recv the following errors on a call to ->text for a certain pdf: >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) >> "invalid stream" x 2 >> "bad args for pdf images" x 1 >> >> and then my loop stops and my script dies. >> >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the >> problem is occurring in the xpdf library (so it seems to me). >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. >> >> I would love to be able to use a construct like this: >> foreach $file (@files){ >> my $ocr = $pdf->text($file) or next; >> # OR # >> if (!defined $ocr) {next;} >> # OR # >> if(xpdf_test($file)) {my $ocr ... } >> } >> >> I am currently implementing a blacklist array (a jail for known >> offenders), but have several thousands of pdfs to run (hopefully there are >> only a few problem documents). >> >> Thank you for the awesome libraries. >> Robert Waters >> >>
> > > -- > Leo Charre >
On Thu Jun 18 20:08:29 2009, robert.waters@gmail.com wrote: Show quoted text
> Thank you so much. > > I've been able to get it "working" by wrapping the call to ->text() in > an > eval block. > It still spits out errors to stderr, but it is generating text. > > Just as an FYI, I am getting the following information logged to the > shell: > glibc detected free(): invalid next size (normal) 0xsomethingsomething > I wouldnt even mention it but I am redirecting stderr to file, and so > am > surprised to see it. > > Thanks for all your help. > > -Robert Waters > > On Thu, Jun 18, 2009 at 7:08 PM, leo charre via RT <bug-PDF- > OCR2@rt.cpan.org
> > wrote:
>
> > <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > > eval... > > > > http://search.cpan.org/~leocharre/PDF-OCR2-
> 1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF- > OCR2-1.13/lib/PDF/OCR2.pm#ERRORS> > 1.13/bin/pdfocrtest
> > > > > > > > The code is basically: > > > > my $instance; > > > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > > > > If you have a script, you can test the pdf before you try to do
> something
> > else to it. > > > > I need to write a little more on this matter. > > As it turns out, PDF::API2 may be finicky about reading some pdfs it
> deems
> > to have bad xref tables.. > > And pdftk can "fix" tables- but then- you are altering the pdf,
> which I
> > want > > to stay away from. > > I hope this helps for now. > > > > > > > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:
> > > > > > Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > > > Transaction: Ticket created by robert.waters@gmail.com > > > Queue: PDF-OCR2 > > > Subject: ->text() fails in xpdf, entire script dies > > > Broken in: (no value) > > > Severity: (no value) > > > Owner: Nobody > > > Requestors: robert.waters@gmail.com > > > Status: new > > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > > > > > > > Is there a way to check for xpdf errors, rather than have the
> program die
> > > because of them? > > > I am currently iterating through a directory of pdfs, OCRing each. > > > I recv the following errors on a call to ->text for a certain pdf: > > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > > "invalid stream" x 2 > > > "bad args for pdf images" x 1 > > > > > > and then my loop stops and my script dies. > > > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but
> the
> > > problem is occurring in the xpdf library (so it seems to me). > > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > > > I would love to be able to use a construct like this: > > > foreach $file (@files){ > > > my $ocr = $pdf->text($file) or next; > > > # OR # > > > if (!defined $ocr) {next;} > > > # OR # > > > if(xpdf_test($file)) {my $ocr ... } > > > } > > > > > > I am currently implementing a blacklist array (a jail for known
> > offenders),
> > > but have several thousands of pdfs to run (hopefully there are
> only a few
> > > problem documents). > > > > > > Thank you for the awesome libraries. > > > Robert Waters > > > > > > > > > Is there a way to check for xpdf errors, rather than have the
> program die
> > > because of them? > > > I am currently iterating through a directory of pdfs, OCRing each. > > > I recv the following errors on a call to ->text for a certain pdf: > > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > > "invalid stream" x 2 > > > "bad args for pdf images" x 1 > > > > > > and then my loop stops and my script dies. > > > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but
> the
> > > problem is occurring in the xpdf library (so it seems to me). > > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > > > I would love to be able to use a construct like this: > > > foreach $file (@files){ > > > my $ocr = $pdf->text($file) or next; > > > # OR # > > > if (!defined $ocr) {next;} > > > # OR # > > > if(xpdf_test($file)) {my $ocr ... } > > > } > > > > > > I am currently implementing a blacklist array (a jail for known
> > offenders),
> > > but have several thousands of pdfs to run (hopefully there are
> only a few
> > > problem documents). > > > > > > Thank you for the awesome libraries. > > > Robert Waters > > > > > >
> > > > > > -- > > Leo Charre > > > > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > > eval... > > > > http://search.cpan.org/~leocharre/PDF-OCR2-
> 1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF- > OCR2-1.13/lib/PDF/OCR2.pm#ERRORS> > 1.13/bin/pdfocrtest
> > > > > > > > The code is basically: > > > > my $instance; > > > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > > > > If you have a script, you can test the pdf before you try to do
> something
> > else to it. > > > > I need to write a little more on this matter. > > As it turns out, PDF::API2 may be finicky about reading some pdfs it
> deems
> > to have bad xref tables.. > > And pdftk can "fix" tables- but then- you are altering the pdf,
> which I
> > want to stay away from. > > I hope this helps for now. > > > > > > > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:
> >> > >> Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > >> Transaction: Ticket created by robert.waters@gmail.com > >> Queue: PDF-OCR2 > >> Subject: ->text() fails in xpdf, entire script dies > >> Broken in: (no value) > >> Severity: (no value) > >> Owner: Nobody > >> Requestors: robert.waters@gmail.com > >> Status: new > >> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > >> > >> > >> Is there a way to check for xpdf errors, rather than have the
> program die
> >> because of them? > >> I am currently iterating through a directory of pdfs, OCRing each. > >> I recv the following errors on a call to ->text for a certain pdf: > >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) > >> "invalid stream" x 2 > >> "bad args for pdf images" x 1 > >> > >> and then my loop stops and my script dies. > >> > >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but
> the
> >> problem is occurring in the xpdf library (so it seems to me). > >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > >> > >> I would love to be able to use a construct like this: > >> foreach $file (@files){ > >> my $ocr = $pdf->text($file) or next; > >> # OR # > >> if (!defined $ocr) {next;} > >> # OR # > >> if(xpdf_test($file)) {my $ocr ... } > >> } > >> > >> I am currently implementing a blacklist array (a jail for known > >> offenders), > >> but have several thousands of pdfs to run (hopefully there are only
> a few
> >> problem documents). > >> > >> Thank you for the awesome libraries. > >> Robert Waters > >> > >> > >> Is there a way to check for xpdf errors, rather than have the
> program die
> >> because of them? > >> I am currently iterating through a directory of pdfs, OCRing each. > >> I recv the following errors on a call to ->text for a certain pdf: > >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) > >> "invalid stream" x 2 > >> "bad args for pdf images" x 1 > >> > >> and then my loop stops and my script dies. > >> > >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but
> the
> >> problem is occurring in the xpdf library (so it seems to me). > >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > >> > >> I would love to be able to use a construct like this: > >> foreach $file (@files){ > >> my $ocr = $pdf->text($file) or next; > >> # OR # > >> if (!defined $ocr) {next;} > >> # OR # > >> if(xpdf_test($file)) {my $ocr ... } > >> } > >> > >> I am currently implementing a blacklist array (a jail for known > >> offenders), but have several thousands of pdfs to run (hopefully
> there are
> >> only a few problem documents). > >> > >> Thank you for the awesome libraries. > >> Robert Waters > >> > >>
> > > > > > -- > > Leo Charre > >
Alright, after much thought and deliberation- I released PDF::OCR2. This version checks a pdf for this problem *by default*. Please see http://search.cpan.org/~leocharre/PDF-OCR2-1.19/lib/PDF/OCR2.pod#$PDF::OCR2::CHECK_PDF There is also a class flag/parameter to allow PDF::OCR2 to make a copy of the file and repair the xref (in that copy, not the original). Basically, now calling text() will not crash your program- it should just return undef. And you'll get warnings to STDERR about why.