framework: [QUEUE] Job aborted for unclear reasons, retried altough shouldn't, but was still running?
- Laravel Version: 7.2.2
- PHP Version: 7.4.3
- Database Driver & Version: Pgsql 9.6
- Horizon: 4.0.2 Note: not sure if it’s horizon related or a queue thing, thus opening it in laravel/framework for now!
Description:
Today I received a strange sentry error I cannot explain, this is the stacktrace:
Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\SomeJob has been attempted too many times or run too long. The job may have previously timed out.
#22 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(647): Illuminate\Queue\Worker::maxAttemptsExceededException
#21 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(434): Illuminate\Queue\Worker::markJobAsFailedIfAlreadyExceedsMaxAttempts
#20 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(346): Illuminate\Queue\Worker::process
#19 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(306): Illuminate\Queue\Worker::runJob
#18 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(132): Illuminate\Queue\Worker::daemon
#17 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\Queue\Console\WorkCommand::runWorker
#16 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\Queue\Console\WorkCommand::handle
#15 /vendor/laravel/horizon/src/Console/WorkCommand.php(46): Laravel\Horizon\Console\WorkCommand::handle
#14 [internal](0): call_user_func_array
#13 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}
#12 /vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\Container\Util::unwrapIfClosure
#11 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\Container\BoundMethod::callBoundMethod
#10 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\Container\BoundMethod::call
#9 /vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\Container\Container::call
#8 /vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\Console\Command::execute
#7 /vendor/symfony/console/Command/Command.php(255): Symfony\Component\Console\Command\Command::run
#6 /vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Illuminate\Console\Command::run
#5 /vendor/symfony/console/Application.php(912): Symfony\Component\Console\Application::doRunCommand
#4 /vendor/symfony/console/Application.php(264): Symfony\Component\Console\Application::doRun
#3 /vendor/symfony/console/Application.php(140): Symfony\Component\Console\Application::run
#2 /vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Illuminate\Console\Application::run
#1 /vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(130): Illuminate\Foundation\Console\Kernel::handle
#0 /artisan(37): null
Ok, please bear with me, this alone doesn’t look strange.
Horizon queue configuration (that specific queue the job ran on):
'jobqueue' => [
'balance' => 'simple',
'connection' => 'redis',
'processes' => 2,
'queue' => ['jobqueue'],
'sleep' => 0,
'timeout' => 0,
'tries' => 1,
],
To have as much insights as possible with our queued jobs, I’m using this in my AppServiceProvider
(this will later explain why I’ve the logs I have):
protected function initializeQueueEvents(): void
{
Queue::before(function (JobProcessing $event) {
$job = $event->job;
$queueName = $job->getQueue();
$logger = $this->getLoggerForQueue($queueName);
$logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
});
Queue::after(function (JobProcessed $event) {
$job = $event->job;
$queueName = $job->getQueue();
$logger = $this->getLoggerForQueue($queueName);
$logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
});
Queue::exceptionOccurred(function (JobExceptionOccurred $event) {
$job = $event->job;
$queueName = $job->getQueue();
$logger = $this->getLoggerForQueue($queueName);
$details = $this->getJobDetails($queueName, $job);
$details['exception'] = $event->exception;
$logger->error(\get_class($event), $details);
$this->reportOtherwiseIgnoredException($event->exception);
});
Queue::failing(function (JobFailed $event) {
$job = $event->job;
$queueName = $job->getQueue();
$logger = $this->getLoggerForQueue($queueName);
$details = $this->getJobDetails($queueName, $job);
$details['exception'] = $event->exception;
$logger->error(\get_class($event), $details);
});
}
So as you can see, there will be very detailed logging.
And this is what I was able to log (unfortunately very very long lines):
[2020-03-25T09:36:42.579873+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}
[2020-03-25T09:38:13.177079+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67077}
[2020-03-25T09:38:13.179925+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:38:13.180380+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:40:50.637754+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}
I tried to reduce this to the bare minimal were I discovered something interesting (I stripped out the job ID, but they are all the same!):
- process 67079 starts processing the job
- process 67077 starts processing the job
- fails and aborts because
tries=1
- fails and aborts because
- process 67079 finishes processing 🤪
So it looks like (looking at the timestamps), almost 30s (but not exactly, but maybe ok) later the job failed (“timed out”?) and started a new one, which immediately failed (expected, tries=1).
But the original was still running?
There were no deployments, code changes or system changes when this happened, i.e. the very horizon supervisor was running for some time already.
Via sentry I also have the information how the worker, which generated this error, was started:
Starting Artisan command: horizon:work
'horizon:work' redis --delay=0 --memory=128 --queue=jobqueue --sleep=0 --timeout=0 --tries=1 --supervisor='server-dlsK:jobqueue'
The two issues I’m facing:
- unclear what the actual reason was?
- seems like the original process was still running when a new one was tried to be dispatched?
Can someone assist in digging into this?
Note regarding reproducibility: hardly, but it happens from time to time
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 19 (18 by maintainers)
@mfn You sure this is issue should be closed? You might have resolved it by using an extremely large timeout but the fact that Horizon does support setting it to null and the queue not seems like abnormal behavior to me.
Oh.my.god
From
conf/queue.php
:Big 🤦‍♂️
TBH I was totally oblivious of that. I manage >20 queues via horizon and basically never checked the
config/queue.php
since ages.Having many queues with different strategies, it seems to me I should rather disable it?
Found people which similar or puzzling issues, why a job which is still actively running would already be retried:
retry_after
is defined on the connection and thetimeout
is per queueretry_after
, only global (on the respective queue connection)I also found https://github.com/laravel/framework/issues/31562#issuecomment-590035918 which explains something I also wasn’t ware of:
retry_after
the config setting is meant for the whole queue (and this is stored on\Illuminate\Queue\RedisQueue::$retryAfter
)$retryAfter
on the Job (probably means\Illuminate\Events\CallQueuedListener::$retryAfter
) which is about delaying retrying a failed jobTL;DR
(or better: TL I still should hve read) Make sure your connection
retry_after
is higher then your highest queue timeout; or disable it altogether!@wfeller disabling via
null
might actually not work though because\Illuminate\Queue\RedisQueue::retrieveNextJob
:if
$this->retryAfter
isnull
it will be treated like0
inavailableAt
which means the job might be marked as being immediately available.I will got for an insanely high value for now, seems safer!
Don’t you just have a too low retry_after queue config value for your job?
Your jobs seem to be retried after 90 seconds (which is the default config value in queue.php) while they seem to run for about 4 minutes, so they’re automatically being retried while the 1st instance of the job is still running.
Maybe try increasing the retry_after value in your queue.php config file?
See: https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts https://github.com/laravel/laravel/blob/master/config/queue.php