Bug #47129 for PDF-OCR2: ->text() fails in xpdf, entire script dies

Thu Jun 18 18:31:17 2009 robert.waters [...] gmail.com - Ticket created

Subject:	->text() fails in xpdf, entire script dies
Date:	Thu, 18 Jun 2009 18:30:44 -0400
To:	bug-pdf-ocr2 [...] rt.cpan.org
From:	R M Waters <robert.waters [...] gmail.com>

Is there a way to check for xpdf errors, rather than have the program die because of them? I am currently iterating through a directory of pdfs, OCRing each. I recv the following errors on a call to ->text for a certain pdf: "CCIT somethingsomething"x 2-10 (this only happens sometimes) "invalid stream" x 2 "bad args for pdf images" x 1 and then my loop stops and my script dies. I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the problem is occurring in the xpdf library (so it seems to me). I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. I would love to be able to use a construct like this: foreach $file (@files){ my $ocr = $pdf->text($file) or next; # OR # if (!defined $ocr) {next;} # OR # if(xpdf_test($file)) {my $ocr ... } } I am currently implementing a blacklist array (a jail for known offenders), but have several thousands of pdfs to run (hopefully there are only a few problem documents). Thank you for the awesome libraries. Robert Waters

Thu Jun 18 19:08:08 2009 leocharre [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #47129] ->text() fails in xpdf, entire script dies
Date:	Thu, 18 Jun 2009 19:07:49 -0400
To:	bug-PDF-OCR2 [...] rt.cpan.org
From:	leo charre <leocharre [...] gmail.com>

Yes. It is to run it via PDF::API2, it might have to be caught by an eval... http://search.cpan.org/~leocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS Also see: http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-1.13/bin/pdfocrtest The code is basically: my $instance; eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' If you have a script, you can test the pdf before you try to do something else to it. I need to write a little more on this matter. As it turns out, PDF::API2 may be finicky about reading some pdfs it deems to have bad xref tables.. And pdftk can "fix" tables- but then- you are altering the pdf, which I want to stay away from. I hope this helps for now. On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote: Show quoted text

> > Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > Transaction: Ticket created by robert.waters@gmail.com > Queue: PDF-OCR2 > Subject: ->text() fails in xpdf, entire script dies > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: robert.waters@gmail.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > Is there a way to check for xpdf errors, rather than have the program die > because of them? > I am currently iterating through a directory of pdfs, OCRing each. > I recv the following errors on a call to ->text for a certain pdf: > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > "invalid stream" x 2 > "bad args for pdf images" x 1 > > and then my loop stops and my script dies. > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > problem is occurring in the xpdf library (so it seems to me). > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > I would love to be able to use a construct like this: > foreach $file (@files){ > my $ocr = $pdf->text($file) or next; > # OR # > if (!defined $ocr) {next;} > # OR # > if(xpdf_test($file)) {my $ocr ... } > } > > I am currently implementing a blacklist array (a jail for known offenders), > but have several thousands of pdfs to run (hopefully there are only a few > problem documents). > > Thank you for the awesome libraries. > Robert Waters > > > Is there a way to check for xpdf errors, rather than have the program die > because of them? > I am currently iterating through a directory of pdfs, OCRing each. > I recv the following errors on a call to ->text for a certain pdf: > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > "invalid stream" x 2 > "bad args for pdf images" x 1 > > and then my loop stops and my script dies. > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > problem is occurring in the xpdf library (so it seems to me). > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > I would love to be able to use a construct like this: > foreach $file (@files){ > my $ocr = $pdf->text($file) or next; > # OR # > if (!defined $ocr) {next;} > # OR # > if(xpdf_test($file)) {my $ocr ... } > } > > I am currently implementing a blacklist array (a jail for known offenders), > but have several thousands of pdfs to run (hopefully there are only a few > problem documents). > > Thank you for the awesome libraries. > Robert Waters > >

-- Leo Charre

Thu Jun 18 19:08:09 2009 The RT System itself - Status changed from 'new' to 'open'

Thu Jun 18 20:08:29 2009 robert.waters [...] gmail.com - Correspondence added

Subject:	Re: [rt.cpan.org #47129] ->text() fails in xpdf, entire script dies
Date:	Thu, 18 Jun 2009 20:08:08 -0400
To:	bug-PDF-OCR2 [...] rt.cpan.org
From:	R M Waters <robert.waters [...] gmail.com>

Thank you so much. I've been able to get it "working" by wrapping the call to ->text() in an eval block. It still spits out errors to stderr, but it is generating text. Just as an FYI, I am getting the following information logged to the shell: glibc detected free(): invalid next size (normal) 0xsomethingsomething I wouldnt even mention it but I am redirecting stderr to file, and so am surprised to see it. Thanks for all your help. -Robert Waters On Thu, Jun 18, 2009 at 7:08 PM, leo charre via RT <bug-PDF-OCR2@rt.cpan.org Show quoted text

> wrote:

Show quoted text

> <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > eval... > > http://search.cpan.org/~leocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS> > > Also see: > http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-1.13/bin/pdfocrtest > > > > The code is basically: > > my $instance; > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > If you have a script, you can test the pdf before you try to do something > else to it. > > I need to write a little more on this matter. > As it turns out, PDF::API2 may be finicky about reading some pdfs it deems > to have bad xref tables.. > And pdftk can "fix" tables- but then- you are altering the pdf, which I > want > to stay away from. > I hope this helps for now. > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:

> > > > Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > > Transaction: Ticket created by robert.waters@gmail.com > > Queue: PDF-OCR2 > > Subject: ->text() fails in xpdf, entire script dies > > Broken in: (no value) > > Severity: (no value) > > Owner: Nobody > > Requestors: robert.waters@gmail.com > > Status: new > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > > > > Is there a way to check for xpdf errors, rather than have the program die > > because of them? > > I am currently iterating through a directory of pdfs, OCRing each. > > I recv the following errors on a call to ->text for a certain pdf: > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > "invalid stream" x 2 > > "bad args for pdf images" x 1 > > > > and then my loop stops and my script dies. > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > > problem is occurring in the xpdf library (so it seems to me). > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > I would love to be able to use a construct like this: > > foreach $file (@files){ > > my $ocr = $pdf->text($file) or next; > > # OR # > > if (!defined $ocr) {next;} > > # OR # > > if(xpdf_test($file)) {my $ocr ... } > > } > > > > I am currently implementing a blacklist array (a jail for known

> offenders),

> > but have several thousands of pdfs to run (hopefully there are only a few > > problem documents). > > > > Thank you for the awesome libraries. > > Robert Waters > > > > > > Is there a way to check for xpdf errors, rather than have the program die > > because of them? > > I am currently iterating through a directory of pdfs, OCRing each. > > I recv the following errors on a call to ->text for a certain pdf: > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > "invalid stream" x 2 > > "bad args for pdf images" x 1 > > > > and then my loop stops and my script dies. > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the > > problem is occurring in the xpdf library (so it seems to me). > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > I would love to be able to use a construct like this: > > foreach $file (@files){ > > my $ocr = $pdf->text($file) or next; > > # OR # > > if (!defined $ocr) {next;} > > # OR # > > if(xpdf_test($file)) {my $ocr ... } > > } > > > > I am currently implementing a blacklist array (a jail for known

> offenders),

> > but have several thousands of pdfs to run (hopefully there are only a few > > problem documents). > > > > Thank you for the awesome libraries. > > Robert Waters > > > >

> > > -- > Leo Charre > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > eval... > > http://search.cpan.org/~leocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF-OCR2-1.13/lib/PDF/OCR2.pm#ERRORS> > > Also see: > http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-1.13/bin/pdfocrtest > > > > The code is basically: > > my $instance; > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > If you have a script, you can test the pdf before you try to do something > else to it. > > I need to write a little more on this matter. > As it turns out, PDF::API2 may be finicky about reading some pdfs it deems > to have bad xref tables.. > And pdftk can "fix" tables- but then- you are altering the pdf, which I > want to stay away from. > I hope this helps for now. > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:

>> >> Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. >> Transaction: Ticket created by robert.waters@gmail.com >> Queue: PDF-OCR2 >> Subject: ->text() fails in xpdf, entire script dies >> Broken in: (no value) >> Severity: (no value) >> Owner: Nobody >> Requestors: robert.waters@gmail.com >> Status: new >> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > >> >> >> Is there a way to check for xpdf errors, rather than have the program die >> because of them? >> I am currently iterating through a directory of pdfs, OCRing each. >> I recv the following errors on a call to ->text for a certain pdf: >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) >> "invalid stream" x 2 >> "bad args for pdf images" x 1 >> >> and then my loop stops and my script dies. >> >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the >> problem is occurring in the xpdf library (so it seems to me). >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. >> >> I would love to be able to use a construct like this: >> foreach $file (@files){ >> my $ocr = $pdf->text($file) or next; >> # OR # >> if (!defined $ocr) {next;} >> # OR # >> if(xpdf_test($file)) {my $ocr ... } >> } >> >> I am currently implementing a blacklist array (a jail for known >> offenders), >> but have several thousands of pdfs to run (hopefully there are only a few >> problem documents). >> >> Thank you for the awesome libraries. >> Robert Waters >> >> >> Is there a way to check for xpdf errors, rather than have the program die >> because of them? >> I am currently iterating through a directory of pdfs, OCRing each. >> I recv the following errors on a call to ->text for a certain pdf: >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) >> "invalid stream" x 2 >> "bad args for pdf images" x 1 >> >> and then my loop stops and my script dies. >> >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but the >> problem is occurring in the xpdf library (so it seems to me). >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. >> >> I would love to be able to use a construct like this: >> foreach $file (@files){ >> my $ocr = $pdf->text($file) or next; >> # OR # >> if (!defined $ocr) {next;} >> # OR # >> if(xpdf_test($file)) {my $ocr ... } >> } >> >> I am currently implementing a blacklist array (a jail for known >> offenders), but have several thousands of pdfs to run (hopefully there are >> only a few problem documents). >> >> Thank you for the awesome libraries. >> Robert Waters >> >>

> > > -- > Leo Charre >

Wed Jun 24 14:45:43 2009 leocharre [...] cpan.org - TimeWorked changed from (no value) to '240'

Wed Jun 24 14:45:43 2009 leocharre [...] cpan.org - Status changed from 'open' to 'resolved'

Wed Jun 24 14:45:46 2009 leocharre [...] cpan.org - Fixed in 1.19 added

Wed Jun 24 14:45:46 2009 leocharre [...] cpan.org - Severity Important added

Wed Jun 24 14:45:46 2009 leocharre [...] cpan.org - Broken in 1.09 added

Wed Jun 24 14:45:46 2009 leocharre [...] cpan.org - Broken in 1.10 added

Wed Jun 24 14:45:46 2009 leocharre [...] cpan.org - Broken in 1.11 added

Wed Jun 24 14:45:47 2009 leocharre [...] cpan.org - Broken in 1.12 added

Wed Jun 24 14:45:47 2009 leocharre [...] cpan.org - Broken in 1.13 added

Wed Jun 24 14:51:29 2009 leocharre [...] cpan.org - Correspondence added

On Thu Jun 18 20:08:29 2009, robert.waters@gmail.com wrote: Show quoted text

> Thank you so much. > > I've been able to get it "working" by wrapping the call to ->text() in > an > eval block. > It still spits out errors to stderr, but it is generating text. > > Just as an FYI, I am getting the following information logged to the > shell: > glibc detected free(): invalid next size (normal) 0xsomethingsomething > I wouldnt even mention it but I am redirecting stderr to file, and so > am > surprised to see it. > > Thanks for all your help. > > -Robert Waters > > On Thu, Jun 18, 2009 at 7:08 PM, leo charre via RT <bug-PDF- > OCR2@rt.cpan.org

> > wrote:

>

> > <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > > eval... > > > > http://search.cpan.org/~leocharre/PDF-OCR2-

> 1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF- > OCR2-1.13/lib/PDF/OCR2.pm#ERRORS>

> > > > Also see: > > http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-

> 1.13/bin/pdfocrtest

> > > > > > > > The code is basically: > > > > my $instance; > > > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > > > > If you have a script, you can test the pdf before you try to do

> something

> > else to it. > > > > I need to write a little more on this matter. > > As it turns out, PDF::API2 may be finicky about reading some pdfs it

> deems

> > to have bad xref tables.. > > And pdftk can "fix" tables- but then- you are altering the pdf,

> which I

> > want > > to stay away from. > > I hope this helps for now. > > > > > > > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:

> > > > > > Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > > > Transaction: Ticket created by robert.waters@gmail.com > > > Queue: PDF-OCR2 > > > Subject: ->text() fails in xpdf, entire script dies > > > Broken in: (no value) > > > Severity: (no value) > > > Owner: Nobody > > > Requestors: robert.waters@gmail.com > > > Status: new > > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > > > > > > > > > Is there a way to check for xpdf errors, rather than have the

> program die

> > > because of them? > > > I am currently iterating through a directory of pdfs, OCRing each. > > > I recv the following errors on a call to ->text for a certain pdf: > > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > > "invalid stream" x 2 > > > "bad args for pdf images" x 1 > > > > > > and then my loop stops and my script dies. > > > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but

> the

> > > problem is occurring in the xpdf library (so it seems to me). > > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > > > I would love to be able to use a construct like this: > > > foreach $file (@files){ > > > my $ocr = $pdf->text($file) or next; > > > # OR # > > > if (!defined $ocr) {next;} > > > # OR # > > > if(xpdf_test($file)) {my $ocr ... } > > > } > > > > > > I am currently implementing a blacklist array (a jail for known

> > offenders),

> > > but have several thousands of pdfs to run (hopefully there are

> only a few

> > > problem documents). > > > > > > Thank you for the awesome libraries. > > > Robert Waters > > > > > > > > > Is there a way to check for xpdf errors, rather than have the

> program die

> > > because of them? > > > I am currently iterating through a directory of pdfs, OCRing each. > > > I recv the following errors on a call to ->text for a certain pdf: > > > "CCIT somethingsomething"x 2-10 (this only happens sometimes) > > > "invalid stream" x 2 > > > "bad args for pdf images" x 1 > > > > > > and then my loop stops and my script dies. > > > > > > I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but

> the

> > > problem is occurring in the xpdf library (so it seems to me). > > > I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > > > > > > I would love to be able to use a construct like this: > > > foreach $file (@files){ > > > my $ocr = $pdf->text($file) or next; > > > # OR # > > > if (!defined $ocr) {next;} > > > # OR # > > > if(xpdf_test($file)) {my $ocr ... } > > > } > > > > > > I am currently implementing a blacklist array (a jail for known

> > offenders),

> > > but have several thousands of pdfs to run (hopefully there are

> only a few

> > > problem documents). > > > > > > Thank you for the awesome libraries. > > > Robert Waters > > > > > >

> > > > > > -- > > Leo Charre > > > > > > Yes. It is to run it via PDF::API2, it might have to be caught by an > > eval... > > > > http://search.cpan.org/~leocharre/PDF-OCR2-

> 1.13/lib/PDF/OCR2.pm#ERRORS<http://search.cpan.org/%7Eleocharre/PDF- > OCR2-1.13/lib/PDF/OCR2.pm#ERRORS>

> > > > Also see: > > http://cpansearch.perl.org/src/LEOCHARRE/PDF-OCR2-

> 1.13/bin/pdfocrtest

> > > > > > > > The code is basically: > > > > my $instance; > > > > eval { $instance = PDF::API2->open($abs_pdf_path) } ? 'ok' : 'bad' > > > > > > If you have a script, you can test the pdf before you try to do

> something

> > else to it. > > > > I need to write a little more on this matter. > > As it turns out, PDF::API2 may be finicky about reading some pdfs it

> deems

> > to have bad xref tables.. > > And pdftk can "fix" tables- but then- you are altering the pdf,

> which I

> > want to stay away from. > > I hope this helps for now. > > > > > > > > > > > > On 6/18/09, R M Waters via RT <bug-PDF-OCR2@rt.cpan.org> wrote:

> >> > >> Thu Jun 18 18:31:17 2009: Request 47129 was acted upon. > >> Transaction: Ticket created by robert.waters@gmail.com > >> Queue: PDF-OCR2 > >> Subject: ->text() fails in xpdf, entire script dies > >> Broken in: (no value) > >> Severity: (no value) > >> Owner: Nobody > >> Requestors: robert.waters@gmail.com > >> Status: new > >> Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=47129 > > >> > >> > >> Is there a way to check for xpdf errors, rather than have the

> program die

> >> because of them? > >> I am currently iterating through a directory of pdfs, OCRing each. > >> I recv the following errors on a call to ->text for a certain pdf: > >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) > >> "invalid stream" x 2 > >> "bad args for pdf images" x 1 > >> > >> and then my loop stops and my script dies. > >> > >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but

> the

> >> problem is occurring in the xpdf library (so it seems to me). > >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > >> > >> I would love to be able to use a construct like this: > >> foreach $file (@files){ > >> my $ocr = $pdf->text($file) or next; > >> # OR # > >> if (!defined $ocr) {next;} > >> # OR # > >> if(xpdf_test($file)) {my $ocr ... } > >> } > >> > >> I am currently implementing a blacklist array (a jail for known > >> offenders), > >> but have several thousands of pdfs to run (hopefully there are only

> a few

> >> problem documents). > >> > >> Thank you for the awesome libraries. > >> Robert Waters > >> > >> > >> Is there a way to check for xpdf errors, rather than have the

> program die

> >> because of them? > >> I am currently iterating through a directory of pdfs, OCRing each. > >> I recv the following errors on a call to ->text for a certain pdf: > >> "CCIT somethingsomething"x 2-10 (this only happens sometimes) > >> "invalid stream" x 2 > >> "bad args for pdf images" x 1 > >> > >> and then my loop stops and my script dies. > >> > >> I have enabled debugging in PDF::OCR2:Page and PDF::GetImages, but

> the

> >> problem is occurring in the xpdf library (so it seems to me). > >> I have also set the PDF::OCR2::CHECK_PDF symbol, it doesnt help. > >> > >> I would love to be able to use a construct like this: > >> foreach $file (@files){ > >> my $ocr = $pdf->text($file) or next; > >> # OR # > >> if (!defined $ocr) {next;} > >> # OR # > >> if(xpdf_test($file)) {my $ocr ... } > >> } > >> > >> I am currently implementing a blacklist array (a jail for known > >> offenders), but have several thousands of pdfs to run (hopefully

> there are

> >> only a few problem documents). > >> > >> Thank you for the awesome libraries. > >> Robert Waters > >> > >>

> > > > > > -- > > Leo Charre > >

Alright, after much thought and deliberation- I released PDF::OCR2. This version checks a pdf for this problem *by default*. Please see http://search.cpan.org/~leocharre/PDF-OCR2-1.19/lib/PDF/OCR2.pod#$PDF::OCR2::CHECK_PDF There is also a class flag/parameter to allow PDF::OCR2 to make a copy of the file and repair the xref (in that copy, not the original). Basically, now calling text() will not crash your program- it should just return undef. And you'll get warnings to STDERR about why.

Wed Jun 24 14:51:30 2009 The RT System itself - Status changed from 'resolved' to 'open'

Wed Jun 24 14:51:30 2009 leocharre [...] cpan.org - Status changed from 'open' to 'resolved'