delayed_job_active_record: Mysql2::Error: Deadlock when attempting to lock a job
We’ve just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when running multiple workers. We’re now seeing occasional MySQL deadlocks, which are unhandled exceptions and end up killing the worker.
For example:
svc1 Jun 5 12:35:17 Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `locked_at` = '2013-06-05 12:35:16', `locked_by` = 'delayed_job.2 host:svc1 pid:20498' WHERE `delayed_jobs`.`queue` IN ('batch') AND ((run_at <= '2013-06-05 12:35:16' AND (locked_at IS NULL OR locked_at < '2013-06-05 08:35:16') OR locked_by = 'delayed_job.2 host:svc1 pid:20498') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1
It would seem that, at the very least, this should be handled.
In case it matters, we’re using delayed_job 3.0.4.
About this issue
- Original URL
- State: open
- Created 11 years ago
- Comments: 128 (13 by maintainers)
Commits related to this issue
- Retry transactions when deadlocked Fixes #63 — committed to aaronjensen/delayed_job_active_record by aaronjensen 10 years ago
- Retry delete when deadlocked Fixes #63 — committed to aaronjensen/delayed_job_active_record by aaronjensen 10 years ago
- Retry delete when deadlocked Fixes #63 — committed to aaronjensen/delayed_job_active_record by aaronjensen 10 years ago
- [#85070062] Sync the repo with the original repo https://www.pivotaltracker.com/story/show/85070062 In order to use the gem in Rails 4.2: * Use the code from the original repository * Apply the pat... — committed to notonthehighstreet/delayed_job_active_record by ventsislaf 9 years ago
- Retry delete when deadlocked Fixes #63 — committed to TeachingChannel/delayed_job_active_record by aaronjensen 10 years ago
- Retry delete when deadlocked Fixes #63 — committed to TeachingChannel/delayed_job_active_record by aaronjensen 10 years ago
- Retry delete when deadlocked Fixes #63 — committed to TeachingChannel/delayed_job_active_record by aaronjensen 10 years ago
edit: see followup at https://github.com/collectiveidea/delayed_job_active_record/issues/63#issuecomment-25468573. the new queries may not be a net gain, even without deadlocks.
We were bit by this as well during an upgrade.
I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.
The output of
show engine innodb statussays the contention is between theUPDATEquery now used to reserve a job, and theDELETEquery used to clean up a finished job. Surprising! Apparently theDELETEquery first acquires a lock on the primary index(id)and then on the secondary index(priority, run_at). But theUPDATEquery is using the(priority, run_at)index to scan the table, and is trying to grab primary key locks as it goes. Eventually theUPDATEandDELETEqueries each grab one of two locks for a given row, try to acquire the other, and 💥. MySQL resolves by killing theUPDATE, which crashes the worker.The fix I’ve worked out locally is to replace the index on
(priority, run_at)with an index on(priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows theUPDATEquery’s scan to skip over rows held by workers, which takes it out of contention with theDELETEquery.Hope this helps.
🔒🔒
@yyyc514 this project is working quite well except for a minority of people using an edge case scenario on mysql. This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all.
About the only solution that stands a chance at working long term, is setting up a configuration option that allows users to change the job locker to the old style. If someone wants to put together a pull request for that, we can look at getting that merged, but it is not a priority item we will be working on directly right now.
Having a similar issue with the latest version (4.1.3) of
delayed_job_active_recordgem. In our case, the deadlock occurs occasionally, when new jobs are being inserted and there is a locked job.I’ve used the following workaround to fix the issue:
@smikkelsen the SQL must look something like this:
My Rails initializer:
I successfully use it in production (MySQL 5.6, 8 workers) and ran into deadlocks permanently using the other SQL strategy (which also littered the mysql log files with replication warnings).
@albus522 I can’t refute or support your assessment of how much of an edge case minority I’m in. I certainly can describe what I consider valid reasons for having a task be run in the background independent of how long the task takes to execute.
But as a member of the edge case minority, I will offer that understanding your position on the issue would have been appreciated. In the absence of that information, quite a few people have spent what seems like a good chunk of time trying to come up with a solution and then present that solution for consideration. Those solutions apparently weren’t going to stand a chance for inclusion, so those of us working on them could have just focused on what you were thinking was a reasonable approach instead.
I’m the person that originally opened this issue. Just wanted to say that I’ve now stopped hoping that someone will actually do something to address this in the gem. It is rather clear from the number of comments on this issue that the current approach is quite broken for a large number of users. In spite of many suggestions to fix the issue, and many pull requests with suggested fixes, it appears as broken today as it did nearly two years ago.
I ended up forking the gem, and just going with the “tried and true” original job reservation approach. I was initially concerned about the performance impacts (after all, this whole ordeal was initiated based on a desire to improve the reservation performance), but I have seen no noticeable performance degradation under load. Note that we only keep 7 days worth of failed jobs around, so perhaps at some point the performance would measurably suffer, if jobs aren’t occasionally removed.
I’ll check back in occasionally, because I am sort of curious just how long this will be allowed to remain an issue on the master branch. But at least I don’t have to worry about cleaning up after deadlocked jobs anymore.
Awesome, now you can choose between a slower option and a fundamentally flawed option (see my above comments to see why it is fundamentally flawed). Not sure why I’m still subscribed to this thread… unsubscribing. good luck all.