The main problem with this bug is it can cause jobs submitted to Helios to be lost without being run. With Helios services that do not retry failed jobs (using MaxRetries() and RetryInterval()), when this bug occurs the job will effectively disappear from the job queue without being passed to the service's run() and without any job history being recorded. (BAD!)
For services that retry failed jobs, it just means one of the retries will be delayed for grab_for() seconds (default: 3600).
The patch included in the 2.601* series prevents the "lost job" problem by shutting down the worker process before the corrupt TheSchwartz::Job is inflated to a Helios::Job. Thus, no jobs will be lost, period. The grab_for() delay will still happen, but there will be NO lost jobs.
An actual fix requires a better explanation:
Apparently there is no problem with Helios, TheSchwartz, or even Data::ObjectDriver. The problem appears to be either with the DBD:: modules in question. At certain times some database queries appear to lose their LOB bindings, which causes LOB fields in the result set to be returned blank. Many of these LOB-handling bugs have been fixed in the past with DBD::mysql and DBD::Oracle, but looking at the DBD::Oracle RT will reveal that several of these are still outstanding. Given the client I worked with on this bug has a older DBD::Oracle that pre-dates some of the LOB handling fixes, and the small occurrence of these issues (0.1-0.4% of jobs), we believe this bug is actually a result of LOB handling bugs in the DBD modules in question.
We will try to implement a deeper fix in Helios 2.8 by checking a job object in the TheSchwartz layer before it is passed into the Helios layers. If a job object is received from the database with no args, it can be discarded and another one selected. But given that any jobs could be lost, even such a small number, we did not want to wait until Helios 2.8 is ready to implement *some* sort of fix.
So for now, if you are experiencing this bug, update to the latest Helios (2.601_3750 for now, 2.61 will be out soon) and update your DBD module to the latest release.
On Sun Aug 11 17:48:19 2013, LAJANDY wrote:
Show quoted text> A potential patch for this bug has been committed to GitHub:
>
>
https://github.com/logicalhelion/helios/commit/25654bbf106be0d91b4447c6e246fc16fe0026f1
>
> If it passes testing, it will be rolled into a forthcoming bugfix
> release.
>
> It should be noted, however, that this does not actually fix the
> problem--it just handles the problem in a way that does not cause non-
> retrying jobs to disappear from the job queue. This bug is actually
> being caused by TheSchwartz for some reason; TheSchwartz is passing
> Helios::Service a TheSchwartz::Job object with an empty string for
> arg(), even though the job in question does indeed have job arguments.
> This causes Helios::Job->new() to bomb when trying to start job
> argument processing--it expects arg() to return an arrayref, NOT a
> string. Changing Helios::Job to handle the empty string is the wrong
> idea--the job actually has arguments, Helios just didn't get them
> (thus, the copy of the job Helios was given is corrupted). Trying to
> run a job while not having its arguments would be worse than not
> running it at all. This patch catches the error, logs a Critical
> error to the Helios log, and the exits the worker process. That way,
> TheSchwartz will not force a failure of the job (which it will do if a
> worker doesn't mark a job as successful or failed) and the job will
> stay in the job queue until its grabbed_until expires and another
> worker process picks it up.
>
> Further future investigation will hopefully reveal the core reason for
> this bug, but this patch at least ensures job integrity and system
> reliability.
>
> On Fri Aug 09 17:13:20 2013, LAJANDY wrote:
> > On Mon Sep 17 09:13:21 2012, LAJANDY wrote:
> > > Sometimes when a worker process picks up a job, it fails with:
> > >
> > > "Can't use string ("") as an ARRAY ref while "strict refs" in use
> > > at
> > > /usr/lib/perl5/site_perl/5.8.8/Helios/Job.pm line 128.”
> > >
> > > in the ERROR table. No success or failure messages are reported in
> > > job
> > > history.
> > >
> > > The job is picked up later by another process and completes
> > > successfully.
> >
> > A GitHub branch has been created for this bug:
> >
https://github.com/logicalhelion/helios/tree/bug/rt79690
> >