framework: [QUEUE] Job aborted for unclear reasons, retried altough shouldn't, but was still running?

Laravel Version: 7.2.2
PHP Version: 7.4.3
Database Driver & Version: Pgsql 9.6
Horizon: 4.0.2 Note: not sure if it’s horizon related or a queue thing, thus opening it in laravel/framework for now!

Description:

Today I received a strange sentry error I cannot explain, this is the stacktrace:

Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\SomeJob has been attempted too many times or run too long. The job may have previously timed out.
#22 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(647): Illuminate\Queue\Worker::maxAttemptsExceededException
#21 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(434): Illuminate\Queue\Worker::markJobAsFailedIfAlreadyExceedsMaxAttempts
#20 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(346): Illuminate\Queue\Worker::process
#19 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(306): Illuminate\Queue\Worker::runJob
#18 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(132): Illuminate\Queue\Worker::daemon
#17 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\Queue\Console\WorkCommand::runWorker
#16 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\Queue\Console\WorkCommand::handle
#15 /vendor/laravel/horizon/src/Console/WorkCommand.php(46): Laravel\Horizon\Console\WorkCommand::handle
#14 [internal](0): call_user_func_array
#13 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}
#12 /vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\Container\Util::unwrapIfClosure
#11 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\Container\BoundMethod::callBoundMethod
#10 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\Container\BoundMethod::call
#9 /vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\Container\Container::call
#8 /vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\Console\Command::execute
#7 /vendor/symfony/console/Command/Command.php(255): Symfony\Component\Console\Command\Command::run
#6 /vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Illuminate\Console\Command::run
#5 /vendor/symfony/console/Application.php(912): Symfony\Component\Console\Application::doRunCommand
#4 /vendor/symfony/console/Application.php(264): Symfony\Component\Console\Application::doRun
#3 /vendor/symfony/console/Application.php(140): Symfony\Component\Console\Application::run
#2 /vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Illuminate\Console\Application::run
#1 /vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(130): Illuminate\Foundation\Console\Kernel::handle
#0 /artisan(37): null

Ok, please bear with me, this alone doesn’t look strange.

Horizon queue configuration (that specific queue the job ran on):

            'jobqueue' => [
                'balance' => 'simple',
                'connection' => 'redis',
                'processes' => 2,
                'queue' => ['jobqueue'],
                'sleep' => 0,
                'timeout' => 0,
                'tries' => 1,
            ],

To have as much insights as possible with our queued jobs, I’m using this in my AppServiceProvider (this will later explain why I’ve the logs I have):

    protected function initializeQueueEvents(): void
    {
        Queue::before(function (JobProcessing $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
        });

        Queue::after(function (JobProcessed $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
        });

        Queue::exceptionOccurred(function (JobExceptionOccurred $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $details = $this->getJobDetails($queueName, $job);
            $details['exception'] = $event->exception;
            $logger->error(\get_class($event), $details);

            $this->reportOtherwiseIgnoredException($event->exception);
        });

        Queue::failing(function (JobFailed $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $details = $this->getJobDetails($queueName, $job);
            $details['exception'] = $event->exception;
            $logger->error(\get_class($event), $details);
        });
    }

So as you can see, there will be very detailed logging.

And this is what I was able to log (unfortunately very very long lines):

[2020-03-25T09:36:42.579873+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing           {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}
[2020-03-25T09:38:13.177079+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing           {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67077}
[2020-03-25T09:38:13.179925+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed              {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:38:13.180380+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred   {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:40:50.637754+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed            {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}

I tried to reduce this to the bare minimal were I discovered something interesting (I stripped out the job ID, but they are all the same!):

process 67079 starts processing the job
process 67077 starts processing the job
- fails and aborts because tries=1
process 67079 finishes processing 🤪

So it looks like (looking at the timestamps), almost 30s (but not exactly, but maybe ok) later the job failed (“timed out”?) and started a new one, which immediately failed (expected, tries=1).

But the original was still running?

There were no deployments, code changes or system changes when this happened, i.e. the very horizon supervisor was running for some time already.

Via sentry I also have the information how the worker, which generated this error, was started:

Starting Artisan command: horizon:work

'horizon:work' redis --delay=0 --memory=128 --queue=jobqueue --sleep=0 --timeout=0 --tries=1 --supervisor='server-dlsK:jobqueue'

The two issues I’m facing:

unclear what the actual reason was?
seems like the original process was still running when a new one was tried to be dispatched?

Can someone assist in digging into this?

Note regarding reproducibility: hardly, but it happens from time to time

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 3
Comments: 19 (18 by maintainers)

Most upvoted comments

@mfn You sure this is issue should be closed? You might have resolved it by using an extremely large timeout but the fact that Horizon does support setting it to null and the queue not seems like abnormal behavior to me.

J-Brk on Sep 27, 2021

Oh.my.god

From conf/queue.php:

        'redis' => [
            'driver' => 'redis',
            'connection' => 'queue',
            'queue' => 'default',
            'retry_after' => 90,
            'block_for' => 3,
        ],

Big 🤦‍♂️

TBH I was totally oblivious of that. I manage >20 queues via horizon and basically never checked the config/queue.php since ages.

Having many queues with different strategies, it seems to me I should rather disable it?

Found people which similar or puzzling issues, why a job which is still actively running would already be retried:

https://github.com/laravel/framework/issues/15696
https://github.com/laravel/framework/issues/30264#issuecomment-543158405
https://github.com/laravel/framework/issues/28920#issuecomment-517172208 This one also talks about the confusion that retry_after is defined on the connection and the timeout is per queue
https://github.com/laravel/framework/issues/27832#issuecomment-473866527 Also emphasizes that you can’t define a per-queue retry_after, only global (on the respective queue connection)
https://github.com/laravel/framework/issues/23728#issuecomment-382457039

I also found https://github.com/laravel/framework/issues/31562#issuecomment-590035918 which explains something I also wasn’t ware of:

retry_after the config setting is meant for the whole queue (and this is stored on \Illuminate\Queue\RedisQueue::$retryAfter)
and there’s a $retryAfter on the Job (probably means \Illuminate\Events\CallQueuedListener::$retryAfter) which is about delaying retrying a failed job

TL;DR

(or better: TL I still should hve read) Make sure your connection retry_after is higher then your highest queue timeout; or disable it altogether!

mfn on Mar 30, 2020

@wfeller disabling via null might actually not work though because \Illuminate\Queue\RedisQueue::retrieveNextJob:

        $nextJob = $this->getConnection()->eval(
            LuaScripts::pop(), 3, $queue, $queue.':reserved', $queue.':notify',
            $this->availableAt($this->retryAfter)
        );

if $this->retryAfter is null it will be treated like 0 in availableAt which means the job might be marked as being immediately available.

I will got for an insanely high value for now, seems safer!

mfn on Mar 30, 2020

Don’t you just have a too low retry_after queue config value for your job?

Your jobs seem to be retried after 90 seconds (which is the default config value in queue.php) while they seem to run for about 4 minutes, so they’re automatically being retried while the 1st instance of the job is still running.

Maybe try increasing the retry_after value in your queue.php config file?

See: https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts https://github.com/laravel/laravel/blob/master/config/queue.php

wfeller on Mar 27, 2020