delayed_job_active_record: Mysql2::Error: Deadlock when attempting to lock a job

We’ve just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when running multiple workers. We’re now seeing occasional MySQL deadlocks, which are unhandled exceptions and end up killing the worker.

For example:

svc1 Jun  5 12:35:17  Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `locked_at` = '2013-06-05 12:35:16', `locked_by` = 'delayed_job.2 host:svc1 pid:20498' WHERE `delayed_jobs`.`queue` IN ('batch') AND ((run_at <= '2013-06-05 12:35:16' AND (locked_at IS NULL OR locked_at < '2013-06-05 08:35:16') OR locked_by = 'delayed_job.2 host:svc1 pid:20498') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

It would seem that, at the very least, this should be handled.

In case it matters, we’re using delayed_job 3.0.4.

About this issue

  • Original URL
  • State: open
  • Created 11 years ago
  • Comments: 128 (13 by maintainers)

Commits related to this issue

Most upvoted comments

edit: see followup at https://github.com/collectiveidea/delayed_job_active_record/issues/63#issuecomment-25468573. the new queries may not be a net gain, even without deadlocks.

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and 💥. MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I’ve worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query’s scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

🔒🔒

@yyyc514 this project is working quite well except for a minority of people using an edge case scenario on mysql. This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all.

About the only solution that stands a chance at working long term, is setting up a configuration option that allows users to change the job locker to the old style. If someone wants to put together a pull request for that, we can look at getting that merged, but it is not a priority item we will be working on directly right now.

Having a similar issue with the latest version (4.1.3) of delayed_job_active_record gem. In our case, the deadlock occurs occasionally, when new jobs are being inserted and there is a locked job.

I’ve used the following workaround to fix the issue:

# config/initializers/delayed_job_active_record.rb
# workaround based on https://github.com/collectiveidea/delayed_job_active_record/pull/91
#
module Delayed
  module Backend
    module ActiveRecord
      class Job < ::ActiveRecord::Base
        def save(*)
          retries = 0

          begin
            super
          rescue ::ActiveRecord::Deadlocked => e
            if retries < 100
              logger.info "ActiveRecord::Deadlocked rescued: #{e.message}"
              logger.info 'Retrying...'

              retries += 1
              retry
            else
              raise
            end
          end
        end
      end
    end
  end
end

@smikkelsen the SQL must look something like this:

Delayed::Backend::ActiveRecord::Job Load (5.2ms)  SELECT `delayed_jobs`.* FROM `delayed_jobs` WHERE `delayed_jobs`.`queue` IN ('sms') AND ((run_at <= '2016-03-03 23:01:09' AND (locked_at IS NULL OR locked_at < '2016-03-03 19:01:09') OR locked_by = 'www02 pid:2225') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC, id ASC LIMIT 5

My Rails initializer:

Delayed::Backend::ActiveRecord.configure do |config|
  config.reserve_sql_strategy = :default_sql
end

I successfully use it in production (MySQL 5.6, 8 workers) and ran into deadlocks permanently using the other SQL strategy (which also littered the mysql log files with replication warnings).

@albus522 I can’t refute or support your assessment of how much of an edge case minority I’m in. I certainly can describe what I consider valid reasons for having a task be run in the background independent of how long the task takes to execute.

But as a member of the edge case minority, I will offer that understanding your position on the issue would have been appreciated. In the absence of that information, quite a few people have spent what seems like a good chunk of time trying to come up with a solution and then present that solution for consideration. Those solutions apparently weren’t going to stand a chance for inclusion, so those of us working on them could have just focused on what you were thinking was a reasonable approach instead.

I’m the person that originally opened this issue. Just wanted to say that I’ve now stopped hoping that someone will actually do something to address this in the gem. It is rather clear from the number of comments on this issue that the current approach is quite broken for a large number of users. In spite of many suggestions to fix the issue, and many pull requests with suggested fixes, it appears as broken today as it did nearly two years ago.

I ended up forking the gem, and just going with the “tried and true” original job reservation approach. I was initially concerned about the performance impacts (after all, this whole ordeal was initiated based on a desire to improve the reservation performance), but I have seen no noticeable performance degradation under load. Note that we only keep 7 days worth of failed jobs around, so perhaps at some point the performance would measurably suffer, if jobs aren’t occasionally removed.

I’ll check back in occasionally, because I am sort of curious just how long this will be allowed to remain an issue on the master branch. But at least I don’t have to worry about cleaning up after deadlocked jobs anymore.

Awesome, now you can choose between a slower option and a fundamentally flawed option (see my above comments to see why it is fundamentally flawed). Not sure why I’m still subscribed to this thread… unsubscribing. good luck all.