Retry stalled jobs¶
When a worker gets a SIGINT
or SIGTERM
signal requesting it to terminate it
waits for running jobs to finish before actually exiting. But, if the worker gets
a second SIGINT
or SIGTERM
signal, or if it’s killed with SIGKILL
, it will
terminate immediately, possibly leaving jobs with the doing
status in the queue.
And, if no specific action is taken, these stalled jobs will remain in the queue
forever, and their execution will never resume.
To address that problem, Procrastinate offers functions that can be used in a periodic task for retrying stalled jobs. Add the following in your code to enable automatic retry of tasks after some time:
# time in seconds for running jobs to be deemed as stalled
RUNNING_JOBS_MAX_TIME = 30
@app.periodic(cron="*/10 * * * *")
@app.task(queueing_lock="retry_stalled_jobs", pass_context=True)
async def retry_stalled_jobs(context, timestamp):
stalled_jobs = await app.job_manager.get_stalled_jobs(
nb_seconds=RUNNING_JOBS_MAX_TIME
)
for job in stalled_jobs:
await app.job_manager.retry_job(job)
This defines a periodic task, configured to be deferred at every 10th minute. The task
retrieves all the jobs that have been in the doing
status for more than
30 seconds, and restarts them (marking them with the todo
status in the database).
With this, if you have multiple workers, and, for some reason, one of them gets killed
while running jobs, then one of the remaining workers will run the
retry_stalled_jobs
task, marking the stalled jobs for retry.
If you have specific rules for task retry (e.g. only some tasks should be retried, based
on specific parameters, or the duration before a task is considered stalled should
depend on the task), you’re free to make the periodic task function more complex and add
your logic to it. See JobManager.get_stalled_jobs()
for details.
Also, note that if a task is considered stalled, it will be retried, but if it’s actually running, then you may have your task running twice. Make sure to only retry a task when you’re reasonably sure that it is not running anymore, so make sure your stalled duration is sufficient.