Bug #79690 for Helios: "Can't use string ("") as an ARRAY ref while "strict refs" in use" error

Mon Sep 17 09:13:21 2012 LAJANDY [...] cpan.org - Ticket created

Sometimes when a worker process picks up a job, it fails with: "Can't use string ("") as an ARRAY ref while "strict refs" in use at /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” in the ERROR table. No success or failure messages are reported in job history. The job is picked up later by another process and completes successfully.

Tue Dec 04 17:47:23 2012 LAJANDY [...] cpan.org - Taken

Tue Dec 04 17:47:29 2012 LAJANDY [...] cpan.org - Status changed from 'new' to 'open'

Tue Dec 04 17:48:11 2012 LAJANDY [...] cpan.org - Correspondence added

Subject:

"Can't use string ("") as an ARRAY ref while "strict refs" in use" error

On Mon Sep 17 09:13:21 2012, LAJANDY wrote: Show quoted text

> Sometimes when a worker process picks up a job, it fails with: > > "Can't use string ("") as an ARRAY ref while "strict refs" in use at > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > in the ERROR table. No success or failure messages are reported in job > history. > > The job is picked up later by another process and completes successfully.

Tue Dec 04 17:48:39 2012 LAJANDY [...] cpan.org - Subject changed from (no value) to 'Can't use string ("") as an ARRAY ref while "strict refs" in use"'

Tue Dec 04 17:48:55 2012 LAJANDY [...] cpan.org - Subject changed from 'Can't use string ("") as an ARRAY ref while "strict refs" in use"' to '"Can't use string ("") as an ARRAY ref while "strict refs" in use" error'

Tue Dec 04 17:49:09 2012 LAJANDY [...] cpan.org - Broken in 2.50_2850 added

Tue Dec 04 17:49:09 2012 LAJANDY [...] cpan.org - Broken in 2.50_2860 added

Tue Dec 04 17:49:09 2012 LAJANDY [...] cpan.org - Broken in 2.50_2910 added

Tue Dec 04 17:49:09 2012 LAJANDY [...] cpan.org - Broken in 2.50_3040 added

Tue Dec 04 17:49:09 2012 LAJANDY [...] cpan.org - Broken in 2.50_3060 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.50_3070 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.50_3160 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.50_3161 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.50_3220 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.50_3630 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.52_3950 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.52_4150 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.52_4310 added

Tue Dec 04 17:49:10 2012 LAJANDY [...] cpan.org - Broken in 2.60 added

Tue Dec 04 17:49:38 2012 LAJANDY [...] cpan.org - Broken in 2.50_2850 deleted

Tue Dec 04 17:49:38 2012 LAJANDY [...] cpan.org - Broken in 2.50_2860 deleted

Tue Dec 04 17:49:38 2012 LAJANDY [...] cpan.org - Broken in 2.50_2910 deleted

Tue Dec 04 17:49:38 2012 LAJANDY [...] cpan.org - Broken in 2.50_3040 deleted

Tue Dec 04 17:49:38 2012 LAJANDY [...] cpan.org - Broken in 2.50_3060 deleted

Tue Dec 04 17:49:38 2012 LAJANDY [...] cpan.org - Broken in 2.50_3070 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.50_3160 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.50_3161 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.50_3220 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.50_3630 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.52_3950 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.52_4150 deleted

Tue Dec 04 17:49:39 2012 LAJANDY [...] cpan.org - Broken in 2.52_4310 deleted

Fri Aug 09 17:13:20 2013 LAJANDY [...] cpan.org - Correspondence added

On Mon Sep 17 09:13:21 2012, LAJANDY wrote: Show quoted text

> Sometimes when a worker process picks up a job, it fails with: > > "Can't use string ("") as an ARRAY ref while "strict refs" in use at > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > in the ERROR table. No success or failure messages are reported in job > history. > > The job is picked up later by another process and completes successfully.

A GitHub branch has been created for this bug: https://github.com/logicalhelion/helios/tree/bug/rt79690

Sun Aug 11 17:48:19 2013 LAJANDY [...] cpan.org - Correspondence added

A potential patch for this bug has been committed to GitHub: https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1 If it passes testing, it will be rolled into a forthcoming bugfix release. It should be noted, however, that this does not actually fix the problem--it just handles the problem in a way that does not cause non-retrying jobs to disappear from the job queue. This bug is actually being caused by TheSchwartz for some reason; TheSchwartz is passing Helios::Service a TheSchwartz::Job object with an empty string for arg(), even though the job in question does indeed have job arguments. This causes Helios::Job->new() to bomb when trying to start job argument processing--it expects arg() to return an arrayref, NOT a string. Changing Helios::Job to handle the empty string is the wrong idea--the job actually has arguments, Helios just didn't get them (thus, the copy of the job Helios was given is corrupted). Trying to run a job while not having its arguments would be worse than not running it at all. This patch catches the error, logs a Critical error to the Helios log, and the exits the worker process. That way, TheSchwartz will not force a failure of the job (which it will do if a worker doesn't mark a job as successful or failed) and the job will stay in the job queue until its grabbed_until expires and another worker process picks it up. Further future investigation will hopefully reveal the core reason for this bug, but this patch at least ensures job integrity and system reliability. On Fri Aug 09 17:13:20 2013, LAJANDY wrote: Show quoted text

> On Mon Sep 17 09:13:21 2012, LAJANDY wrote:

> > Sometimes when a worker process picks up a job, it fails with: > > > > "Can't use string ("") as an ARRAY ref while "strict refs" in use at > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > > > in the ERROR table. No success or failure messages are reported in job > > history. > > > > The job is picked up later by another process and completes successfully.

> > A GitHub branch has been created for this bug: > https://github.com/logicalhelion/helios/tree/bug/rt79690 >

Fri Sep 13 17:17:39 2013 LAJANDY [...] cpan.org - Correspondence added

The main problem with this bug is it can cause jobs submitted to Helios to be lost without being run. With Helios services that do not retry failed jobs (using MaxRetries() and RetryInterval()), when this bug occurs the job will effectively disappear from the job queue without being passed to the service's run() and without any job history being recorded. (BAD!) For services that retry failed jobs, it just means one of the retries will be delayed for grab_for() seconds (default: 3600). The patch included in the 2.601* series prevents the "lost job" problem by shutting down the worker process before the corrupt TheSchwartz::Job is inflated to a Helios::Job. Thus, no jobs will be lost, period. The grab_for() delay will still happen, but there will be NO lost jobs. An actual fix requires a better explanation: Apparently there is no problem with Helios, TheSchwartz, or even Data::ObjectDriver. The problem appears to be either with the DBD:: modules in question. At certain times some database queries appear to lose their LOB bindings, which causes LOB fields in the result set to be returned blank. Many of these LOB-handling bugs have been fixed in the past with DBD::mysql and DBD::Oracle, but looking at the DBD::Oracle RT will reveal that several of these are still outstanding. Given the client I worked with on this bug has a older DBD::Oracle that pre-dates some of the LOB handling fixes, and the small occurrence of these issues (0.1-0.4% of jobs), we believe this bug is actually a result of LOB handling bugs in the DBD modules in question. We will try to implement a deeper fix in Helios 2.8 by checking a job object in the TheSchwartz layer before it is passed into the Helios layers. If a job object is received from the database with no args, it can be discarded and another one selected. But given that any jobs could be lost, even such a small number, we did not want to wait until Helios 2.8 is ready to implement *some* sort of fix. So for now, if you are experiencing this bug, update to the latest Helios (2.601_3750 for now, 2.61 will be out soon) and update your DBD module to the latest release. On Sun Aug 11 17:48:19 2013, LAJANDY wrote: Show quoted text

> A potential patch for this bug has been committed to GitHub: > > https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1 > > If it passes testing, it will be rolled into a forthcoming bugfix > release. > > It should be noted, however, that this does not actually fix the > problem--it just handles the problem in a way that does not cause non- > retrying jobs to disappear from the job queue. This bug is actually > being caused by TheSchwartz for some reason; TheSchwartz is passing > Helios::Service a TheSchwartz::Job object with an empty string for > arg(), even though the job in question does indeed have job arguments. > This causes Helios::Job->new() to bomb when trying to start job > argument processing--it expects arg() to return an arrayref, NOT a > string. Changing Helios::Job to handle the empty string is the wrong > idea--the job actually has arguments, Helios just didn't get them > (thus, the copy of the job Helios was given is corrupted). Trying to > run a job while not having its arguments would be worse than not > running it at all. This patch catches the error, logs a Critical > error to the Helios log, and the exits the worker process. That way, > TheSchwartz will not force a failure of the job (which it will do if a > worker doesn't mark a job as successful or failed) and the job will > stay in the job queue until its grabbed_until expires and another > worker process picks it up. > > Further future investigation will hopefully reveal the core reason for > this bug, but this patch at least ensures job integrity and system > reliability. > > On Fri Aug 09 17:13:20 2013, LAJANDY wrote:

> > On Mon Sep 17 09:13:21 2012, LAJANDY wrote:

> > > Sometimes when a worker process picks up a job, it fails with: > > > > > > "Can't use string ("") as an ARRAY ref while "strict refs" in use > > > at > > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > > > > > in the ERROR table. No success or failure messages are reported in > > > job > > > history. > > > > > > The job is picked up later by another process and completes > > > successfully.

> > > > A GitHub branch has been created for this bug: > > https://github.com/logicalhelion/helios/tree/bug/rt79690 > >

Fri Sep 13 17:17:39 2013 LAJANDY [...] cpan.org - Status changed from 'open' to 'patched'

Fri Sep 13 17:17:39 2013 LAJANDY [...] cpan.org - Fixed in 2.601_3610 added

Fri Sep 13 17:17:40 2013 LAJANDY [...] cpan.org - Fixed in 2.601_3670 added

Sat Sep 14 19:00:29 2013 LAJANDY [...] cpan.org - Correspondence added

On Fri Sep 13 17:17:39 2013, LAJANDY wrote: Show quoted text

> The main problem with this bug is it can cause jobs submitted to > Helios to be lost without being run. With Helios services that do not > retry failed jobs (using MaxRetries() and RetryInterval()), when this > bug occurs the job will effectively disappear from the job queue > without being passed to the service's run() and without any job > history being recorded. (BAD!) > > For services that retry failed jobs, it just means one of the retries > will be delayed for grab_for() seconds (default: 3600). > > The patch included in the 2.601* series prevents the "lost job" > problem by shutting down the worker process before the corrupt > TheSchwartz::Job is inflated to a Helios::Job. Thus, no jobs will be > lost, period. The grab_for() delay will still happen, but there will > be NO lost jobs. > > An actual fix requires a better explanation: > > Apparently there is no problem with Helios, TheSchwartz, or even > Data::ObjectDriver. The problem appears to be either with the DBD:: > modules in question. At certain times some database queries appear to > lose their LOB bindings, which causes LOB fields in the result set to > be returned blank. Many of these LOB-handling bugs have been fixed in > the past with DBD::mysql and DBD::Oracle, but looking at the > DBD::Oracle RT will reveal that several of these are still > outstanding. Given the client I worked with on this bug has a older > DBD::Oracle that pre-dates some of the LOB handling fixes, and the > small occurrence of these issues (0.1-0.4% of jobs), we believe this > bug is actually a result of LOB handling bugs in the DBD modules in > question. > > We will try to implement a deeper fix in Helios 2.8 by checking a job > object in the TheSchwartz layer before it is passed into the Helios > layers. If a job object is received from the database with no args, > it can be discarded and another one selected. But given that any jobs > could be lost, even such a small number, we did not want to wait until > Helios 2.8 is ready to implement *some* sort of fix. > > So for now, if you are experiencing this bug, update to the latest > Helios (2.601_3750 for now, 2.61 will be out soon) and update your DBD > module to the latest release. > > On Sun Aug 11 17:48:19 2013, LAJANDY wrote:

> > A potential patch for this bug has been committed to GitHub: > > > > https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1 > > > > If it passes testing, it will be rolled into a forthcoming bugfix > > release. > > > > It should be noted, however, that this does not actually fix the > > problem--it just handles the problem in a way that does not cause > > non- > > retrying jobs to disappear from the job queue. This bug is actually > > being caused by TheSchwartz for some reason; TheSchwartz is passing > > Helios::Service a TheSchwartz::Job object with an empty string for > > arg(), even though the job in question does indeed have job > > arguments. > > This causes Helios::Job->new() to bomb when trying to start job > > argument processing--it expects arg() to return an arrayref, NOT a > > string. Changing Helios::Job to handle the empty string is the wrong > > idea--the job actually has arguments, Helios just didn't get them > > (thus, the copy of the job Helios was given is corrupted). Trying to > > run a job while not having its arguments would be worse than not > > running it at all. This patch catches the error, logs a Critical > > error to the Helios log, and the exits the worker process. That way, > > TheSchwartz will not force a failure of the job (which it will do if > > a > > worker doesn't mark a job as successful or failed) and the job will > > stay in the job queue until its grabbed_until expires and another > > worker process picks it up. > > > > Further future investigation will hopefully reveal the core reason > > for > > this bug, but this patch at least ensures job integrity and system > > reliability. > > > > On Fri Aug 09 17:13:20 2013, LAJANDY wrote:

> > > On Mon Sep 17 09:13:21 2012, LAJANDY wrote:

> > > > Sometimes when a worker process picks up a job, it fails with: > > > > > > > > "Can't use string ("") as an ARRAY ref while "strict refs" in use > > > > at > > > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > > > > > > > in the ERROR table. No success or failure messages are reported > > > > in > > > > job > > > > history. > > > > > > > > The job is picked up later by another process and completes > > > > successfully.

> > > > > > A GitHub branch has been created for this bug: > > > https://github.com/logicalhelion/helios/tree/bug/rt79690 > > >

Sat Sep 14 19:00:29 2013 LAJANDY [...] cpan.org - Fixed in 2.601_3750 added

Wed Oct 16 20:08:02 2013 LAJANDY [...] cpan.org - Correspondence added

The workaround for this bug included with Helios::TS in 2.71_4051 appears to fix this problem. It will be included in the forthcoming Helios 2.80 production release. Helios::TS->_grab_a_job() checks a job object before it is "grabbed" (the job is locked in the queue for processing) and makes sure the arg() method is reporting a reference to a data structure. If arg() does not return a reference, _grab_a_job() skips that job and pulls the next one from the array of jobs retrieved from the job queue. This way, any LOB-binding problem resulting in a blank arg() value will be avoided. On Sat Sep 14 19:00:29 2013, LAJANDY wrote: Show quoted text

> On Fri Sep 13 17:17:39 2013, LAJANDY wrote:

> > The main problem with this bug is it can cause jobs submitted to > > Helios to be lost without being run. With Helios services that do > > not > > retry failed jobs (using MaxRetries() and RetryInterval()), when this > > bug occurs the job will effectively disappear from the job queue > > without being passed to the service's run() and without any job > > history being recorded. (BAD!) > > > > For services that retry failed jobs, it just means one of the retries > > will be delayed for grab_for() seconds (default: 3600). > > > > The patch included in the 2.601* series prevents the "lost job" > > problem by shutting down the worker process before the corrupt > > TheSchwartz::Job is inflated to a Helios::Job. Thus, no jobs will be > > lost, period. The grab_for() delay will still happen, but there will > > be NO lost jobs. > > > > An actual fix requires a better explanation: > > > > Apparently there is no problem with Helios, TheSchwartz, or even > > Data::ObjectDriver. The problem appears to be either with the DBD:: > > modules in question. At certain times some database queries appear > > to > > lose their LOB bindings, which causes LOB fields in the result set to > > be returned blank. Many of these LOB-handling bugs have been fixed > > in > > the past with DBD::mysql and DBD::Oracle, but looking at the > > DBD::Oracle RT will reveal that several of these are still > > outstanding. Given the client I worked with on this bug has a older > > DBD::Oracle that pre-dates some of the LOB handling fixes, and the > > small occurrence of these issues (0.1-0.4% of jobs), we believe this > > bug is actually a result of LOB handling bugs in the DBD modules in > > question. > > > > We will try to implement a deeper fix in Helios 2.8 by checking a job > > object in the TheSchwartz layer before it is passed into the Helios > > layers. If a job object is received from the database with no args, > > it can be discarded and another one selected. But given that any > > jobs > > could be lost, even such a small number, we did not want to wait > > until > > Helios 2.8 is ready to implement *some* sort of fix. > > > > So for now, if you are experiencing this bug, update to the latest > > Helios (2.601_3750 for now, 2.61 will be out soon) and update your > > DBD > > module to the latest release. > > > > On Sun Aug 11 17:48:19 2013, LAJANDY wrote:

> > > A potential patch for this bug has been committed to GitHub: > > > > > > https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1 > > > > > > If it passes testing, it will be rolled into a forthcoming bugfix > > > release. > > > > > > It should be noted, however, that this does not actually fix the > > > problem--it just handles the problem in a way that does not cause > > > non- > > > retrying jobs to disappear from the job queue. This bug is > > > actually > > > being caused by TheSchwartz for some reason; TheSchwartz is passing > > > Helios::Service a TheSchwartz::Job object with an empty string for > > > arg(), even though the job in question does indeed have job > > > arguments. > > > This causes Helios::Job->new() to bomb when trying to start job > > > argument processing--it expects arg() to return an arrayref, NOT a > > > string. Changing Helios::Job to handle the empty string is the > > > wrong > > > idea--the job actually has arguments, Helios just didn't get them > > > (thus, the copy of the job Helios was given is corrupted). Trying > > > to > > > run a job while not having its arguments would be worse than not > > > running it at all. This patch catches the error, logs a Critical > > > error to the Helios log, and the exits the worker process. That > > > way, > > > TheSchwartz will not force a failure of the job (which it will do > > > if > > > a > > > worker doesn't mark a job as successful or failed) and the job will > > > stay in the job queue until its grabbed_until expires and another > > > worker process picks it up. > > > > > > Further future investigation will hopefully reveal the core reason > > > for > > > this bug, but this patch at least ensures job integrity and system > > > reliability. > > > > > > On Fri Aug 09 17:13:20 2013, LAJANDY wrote:

> > > > On Mon Sep 17 09:13:21 2012, LAJANDY wrote:

> > > > > Sometimes when a worker process picks up a job, it fails with: > > > > > > > > > > "Can't use string ("") as an ARRAY ref while "strict refs" in > > > > > use > > > > > at > > > > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > > > > > > > > > in the ERROR table. No success or failure messages are > > > > > reported > > > > > in > > > > > job > > > > > history. > > > > > > > > > > The job is picked up later by another process and completes > > > > > successfully.

> > > > > > > > A GitHub branch has been created for this bug: > > > > https://github.com/logicalhelion/helios/tree/bug/rt79690 > > > >

Wed Oct 16 20:08:03 2013 LAJANDY [...] cpan.org - Fixed in 2.61 added

Wed Oct 16 20:08:03 2013 LAJANDY [...] cpan.org - Fixed in 2.71_3860 added

Wed Oct 16 20:08:03 2013 LAJANDY [...] cpan.org - Fixed in 2.71_4051 added

Sat Mar 15 22:12:55 2014 LAJANDY [...] cpan.org - Correspondence added

On Wed Oct 16 20:08:02 2013, LAJANDY wrote: Show quoted text

> The workaround for this bug included with Helios::TS in 2.71_4051 > appears to fix this problem. It will be included in the forthcoming > Helios 2.80 production release. > > Helios::TS->_grab_a_job() checks a job object before it is "grabbed" > (the job is locked in the queue for processing) and makes sure the > arg() method is reporting a reference to a data structure. If arg() > does not return a reference, _grab_a_job() skips that job and pulls > the next one from the array of jobs retrieved from the job queue. > This way, any LOB-binding problem resulting in a blank arg() value > will be avoided. > > On Sat Sep 14 19:00:29 2013, LAJANDY wrote:

> > On Fri Sep 13 17:17:39 2013, LAJANDY wrote:

> > > The main problem with this bug is it can cause jobs submitted to > > > Helios to be lost without being run. With Helios services that do > > > not > > > retry failed jobs (using MaxRetries() and RetryInterval()), when > > > this > > > bug occurs the job will effectively disappear from the job queue > > > without being passed to the service's run() and without any job > > > history being recorded. (BAD!) > > > > > > For services that retry failed jobs, it just means one of the > > > retries > > > will be delayed for grab_for() seconds (default: 3600). > > > > > > The patch included in the 2.601* series prevents the "lost job" > > > problem by shutting down the worker process before the corrupt > > > TheSchwartz::Job is inflated to a Helios::Job. Thus, no jobs will > > > be > > > lost, period. The grab_for() delay will still happen, but there > > > will > > > be NO lost jobs. > > > > > > An actual fix requires a better explanation: > > > > > > Apparently there is no problem with Helios, TheSchwartz, or even > > > Data::ObjectDriver. The problem appears to be either with the > > > DBD:: > > > modules in question. At certain times some database queries appear > > > to > > > lose their LOB bindings, which causes LOB fields in the result set > > > to > > > be returned blank. Many of these LOB-handling bugs have been fixed > > > in > > > the past with DBD::mysql and DBD::Oracle, but looking at the > > > DBD::Oracle RT will reveal that several of these are still > > > outstanding. Given the client I worked with on this bug has a > > > older > > > DBD::Oracle that pre-dates some of the LOB handling fixes, and the > > > small occurrence of these issues (0.1-0.4% of jobs), we believe > > > this > > > bug is actually a result of LOB handling bugs in the DBD modules in > > > question. > > > > > > We will try to implement a deeper fix in Helios 2.8 by checking a > > > job > > > object in the TheSchwartz layer before it is passed into the Helios > > > layers. If a job object is received from the database with no > > > args, > > > it can be discarded and another one selected. But given that any > > > jobs > > > could be lost, even such a small number, we did not want to wait > > > until > > > Helios 2.8 is ready to implement *some* sort of fix. > > > > > > So for now, if you are experiencing this bug, update to the latest > > > Helios (2.601_3750 for now, 2.61 will be out soon) and update your > > > DBD > > > module to the latest release. > > > > > > On Sun Aug 11 17:48:19 2013, LAJANDY wrote:

> > > > A potential patch for this bug has been committed to GitHub: > > > > > > > > https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1 > > > > > > > > If it passes testing, it will be rolled into a forthcoming bugfix > > > > release. > > > > > > > > It should be noted, however, that this does not actually fix the > > > > problem--it just handles the problem in a way that does not cause > > > > non- > > > > retrying jobs to disappear from the job queue. This bug is > > > > actually > > > > being caused by TheSchwartz for some reason; TheSchwartz is > > > > passing > > > > Helios::Service a TheSchwartz::Job object with an empty string > > > > for > > > > arg(), even though the job in question does indeed have job > > > > arguments. > > > > This causes Helios::Job->new() to bomb when trying to start job > > > > argument processing--it expects arg() to return an arrayref, NOT > > > > a > > > > string. Changing Helios::Job to handle the empty string is the > > > > wrong > > > > idea--the job actually has arguments, Helios just didn't get them > > > > (thus, the copy of the job Helios was given is corrupted). > > > > Trying > > > > to > > > > run a job while not having its arguments would be worse than not > > > > running it at all. This patch catches the error, logs a Critical > > > > error to the Helios log, and the exits the worker process. That > > > > way, > > > > TheSchwartz will not force a failure of the job (which it will do > > > > if > > > > a > > > > worker doesn't mark a job as successful or failed) and the job > > > > will > > > > stay in the job queue until its grabbed_until expires and another > > > > worker process picks it up. > > > > > > > > Further future investigation will hopefully reveal the core > > > > reason > > > > for > > > > this bug, but this patch at least ensures job integrity and > > > > system > > > > reliability. > > > > > > > > On Fri Aug 09 17:13:20 2013, LAJANDY wrote:

> > > > > On Mon Sep 17 09:13:21 2012, LAJANDY wrote:

> > > > > > Sometimes when a worker process picks up a job, it fails > > > > > > with: > > > > > > > > > > > > "Can't use string ("") as an ARRAY ref while "strict refs" in > > > > > > use > > > > > > at > > > > > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > > > > > > > > > > > in the ERROR table. No success or failure messages are > > > > > > reported > > > > > > in > > > > > > job > > > > > > history. > > > > > > > > > > > > The job is picked up later by another process and completes > > > > > > successfully.

> > > > > > > > > > A GitHub branch has been created for this bug: > > > > > https://github.com/logicalhelion/helios/tree/bug/rt79690 > > > > >

Sat Mar 15 22:12:55 2014 LAJANDY [...] cpan.org - Fixed in 2.71_4250 added

Sat Mar 15 22:12:55 2014 LAJANDY [...] cpan.org - Fixed in 2.71_4350 added

Sat Mar 15 22:12:55 2014 LAJANDY [...] cpan.org - Fixed in 2.71_4460 added

Sat Mar 15 22:12:56 2014 LAJANDY [...] cpan.org - Fixed in 2.71_4770 added

Sat Mar 15 22:12:56 2014 LAJANDY [...] cpan.org - Fixed in 2.72_0950 added

Fri Mar 21 17:17:48 2014 LAJANDY [...] cpan.org - Correspondence added

On Sat Mar 15 22:12:55 2014, LAJANDY wrote: Show quoted text

> On Wed Oct 16 20:08:02 2013, LAJANDY wrote:

> > The workaround for this bug included with Helios::TS in 2.71_4051 > > appears to fix this problem. It will be included in the forthcoming > > Helios 2.80 production release. > > > > Helios::TS->_grab_a_job() checks a job object before it is "grabbed" > > (the job is locked in the queue for processing) and makes sure the > > arg() method is reporting a reference to a data structure. If arg() > > does not return a reference, _grab_a_job() skips that job and pulls > > the next one from the array of jobs retrieved from the job queue. > > This way, any LOB-binding problem resulting in a blank arg() value > > will be avoided. > > > > On Sat Sep 14 19:00:29 2013, LAJANDY wrote:

> > > On Fri Sep 13 17:17:39 2013, LAJANDY wrote:

> > > > The main problem with this bug is it can cause jobs submitted to > > > > Helios to be lost without being run. With Helios services that > > > > do > > > > not > > > > retry failed jobs (using MaxRetries() and RetryInterval()), when > > > > this > > > > bug occurs the job will effectively disappear from the job queue > > > > without being passed to the service's run() and without any job > > > > history being recorded. (BAD!) > > > > > > > > For services that retry failed jobs, it just means one of the > > > > retries > > > > will be delayed for grab_for() seconds (default: 3600). > > > > > > > > The patch included in the 2.601* series prevents the "lost job" > > > > problem by shutting down the worker process before the corrupt > > > > TheSchwartz::Job is inflated to a Helios::Job. Thus, no jobs > > > > will > > > > be > > > > lost, period. The grab_for() delay will still happen, but there > > > > will > > > > be NO lost jobs. > > > > > > > > An actual fix requires a better explanation: > > > > > > > > Apparently there is no problem with Helios, TheSchwartz, or even > > > > Data::ObjectDriver. The problem appears to be either with the > > > > DBD:: > > > > modules in question. At certain times some database queries > > > > appear > > > > to > > > > lose their LOB bindings, which causes LOB fields in the result > > > > set > > > > to > > > > be returned blank. Many of these LOB-handling bugs have been > > > > fixed > > > > in > > > > the past with DBD::mysql and DBD::Oracle, but looking at the > > > > DBD::Oracle RT will reveal that several of these are still > > > > outstanding. Given the client I worked with on this bug has a > > > > older > > > > DBD::Oracle that pre-dates some of the LOB handling fixes, and > > > > the > > > > small occurrence of these issues (0.1-0.4% of jobs), we believe > > > > this > > > > bug is actually a result of LOB handling bugs in the DBD modules > > > > in > > > > question. > > > > > > > > We will try to implement a deeper fix in Helios 2.8 by checking a > > > > job > > > > object in the TheSchwartz layer before it is passed into the > > > > Helios > > > > layers. If a job object is received from the database with no > > > > args, > > > > it can be discarded and another one selected. But given that any > > > > jobs > > > > could be lost, even such a small number, we did not want to wait > > > > until > > > > Helios 2.8 is ready to implement *some* sort of fix. > > > > > > > > So for now, if you are experiencing this bug, update to the > > > > latest > > > > Helios (2.601_3750 for now, 2.61 will be out soon) and update > > > > your > > > > DBD > > > > module to the latest release. > > > > > > > > On Sun Aug 11 17:48:19 2013, LAJANDY wrote:

> > > > > A potential patch for this bug has been committed to GitHub: > > > > > > > > > > https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1 > > > > > > > > > > If it passes testing, it will be rolled into a forthcoming > > > > > bugfix > > > > > release. > > > > > > > > > > It should be noted, however, that this does not actually fix > > > > > the > > > > > problem--it just handles the problem in a way that does not > > > > > cause > > > > > non- > > > > > retrying jobs to disappear from the job queue. This bug is > > > > > actually > > > > > being caused by TheSchwartz for some reason; TheSchwartz is > > > > > passing > > > > > Helios::Service a TheSchwartz::Job object with an empty string > > > > > for > > > > > arg(), even though the job in question does indeed have job > > > > > arguments. > > > > > This causes Helios::Job->new() to bomb when trying to start job > > > > > argument processing--it expects arg() to return an arrayref, > > > > > NOT > > > > > a > > > > > string. Changing Helios::Job to handle the empty string is the > > > > > wrong > > > > > idea--the job actually has arguments, Helios just didn't get > > > > > them > > > > > (thus, the copy of the job Helios was given is corrupted). > > > > > Trying > > > > > to > > > > > run a job while not having its arguments would be worse than > > > > > not > > > > > running it at all. This patch catches the error, logs a > > > > > Critical > > > > > error to the Helios log, and the exits the worker process. > > > > > That > > > > > way, > > > > > TheSchwartz will not force a failure of the job (which it will > > > > > do > > > > > if > > > > > a > > > > > worker doesn't mark a job as successful or failed) and the job > > > > > will > > > > > stay in the job queue until its grabbed_until expires and > > > > > another > > > > > worker process picks it up. > > > > > > > > > > Further future investigation will hopefully reveal the core > > > > > reason > > > > > for > > > > > this bug, but this patch at least ensures job integrity and > > > > > system > > > > > reliability. > > > > > > > > > > On Fri Aug 09 17:13:20 2013, LAJANDY wrote:

> > > > > > On Mon Sep 17 09:13:21 2012, LAJANDY wrote:

> > > > > > > Sometimes when a worker process picks up a job, it fails > > > > > > > with: > > > > > > > > > > > > > > "Can't use string ("") as an ARRAY ref while "strict refs" > > > > > > > in > > > > > > > use > > > > > > > at > > > > > > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.” > > > > > > > > > > > > > > in the ERROR table. No success or failure messages are > > > > > > > reported > > > > > > > in > > > > > > > job > > > > > > > history. > > > > > > > > > > > > > > The job is picked up later by another process and completes > > > > > > > successfully.

> > > > > > > > > > > > A GitHub branch has been created for this bug: > > > > > > https://github.com/logicalhelion/helios/tree/bug/rt79690 > > > > > >

Fri Mar 21 17:17:49 2014 LAJANDY [...] cpan.org - Status changed from 'patched' to 'resolved'

Fri Mar 21 17:17:49 2014 LAJANDY [...] cpan.org - Fixed in 2.80 added