sidekiq-unique-jobs: until_and_while_executing jobs can conflict with themselves
Describe the bug
Jobs for workers that use until_and_while_executing
sometimes get stuck because they appear to conflict with themselves.
Expected behavior
Executing StatusRequestWorker.perform_async(123)
should run the job.
Current behavior Note: The problem only occurs sporadically.
A SidekiqUniqueJobs::Conflict
error is raised. The error message says “Item with the key: uniquejobs:0a7563aa8a2c1b4d81c6c8d02a4b3e96:RUN is already scheduled or processing”.
Further inspection of the locks reveals that the lock is taken by the job itself (JID 28b40e216188051797e08954
in this example). Since the lock is never released, the job does not execute and eventually dies once the retry attempts are exhausted.
Worker class
class StatusRequestWorker
include Sidekiq::Worker
sidekiq_options lock: :until_and_while_executing, on_conflict: { client: :log, server: :raise }
def perform(id)
# ...
end
end
Additional context
- Using
sidekiq-unique-jobs
v7.1.5
Screenshots:
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 16 (7 by maintainers)
Commits related to this issue
- update heartbeat worker uniqueness strategy - related to https://github.com/mhenrixon/sidekiq-unique-jobs/issues/630 — committed to keygen-sh/keygen-api by ezekg 2 years ago
I think it’s a sidekiq issue, not a toxiproxy issue.
First, I think you need sidekiq to actually perform the job inline, not just enqueue (the default when including “sidekiq/testing”, ie.
fake!
):I believe the root bug is around job execution, not job enqueueing, hence the importance of actually performing the job. The errors I saw in my application were always when sidekiq tried to perform the job, never when sidekiq tried to enqueue the job.
Second, when using the sidekiq test harness, no server middleware is run, so you’re not actually executing any of the unique locking code. You need to configure test middleware separately:
More info: https://github.com/mperham/sidekiq/wiki/Testing#testing-server-middleware
If I add those two bits to your
before
block in your PR, the test fails (as we’re expecting). If I reduce the toxiproxy latency to 20 and ditch thejitter
(matching my repro), the test also fails. If I drop the toxiproxy latency to 15, the test passes, so at least proving with little-to-no latency, things proceed without error.Progress!
I dug further into this and believe I worked up a reproduction script: https://gist.github.com/cgunther/718379c4786e445e75a4e62ce612a9ee
I think it’s a latency issue. Using toxiproxy with just 20ms latency I can reliably reproduce the issue via that script, just be sure toxiproxy is running first (
toxiproxy-server
).Setting a 1 second lock time (
SidekiqUniqueJobs.config.lock_timeout = 1
) makes the test pass.Ultimately I’m not sure if this is a bug or working as designed/configured. The locksmith calls
pop_queued
asynchronously, and without a timeout that ends up callingrpoplpush
, then the promise wrappingpop_queued
seems to timeout quickly with the lack of a timeout set, andrpoplpush
hasn’t returned yet, so the gem kind of aborts, but therpoplpush
call eventually succeeds, leaving data in redis, which then causes issues when retried? Or at least that’s what I’m thinking is happening?There are notes about this in the readme about the while executing lock:
But there still seems to be a conflict in that when specify the conflict strategy of
:raise
, while your job is technically retried, it seems it’s unlikely to ever succeed due to the lingering lock, not exactly the “retry” behavior I’d expect. Should the gem try to clean up the lock in the case of such a timeout, but then I’m not sure how to ensure we’ve waited long enough for therpoplpush
to have finished before clearing the lock.That’s as far as I’ve gotten so far.
I have an idea, I’ll see if I can find a way to finally fix this in the next few days.
I deployed v7.1.26, so will keep an eye on things and report back.
For now, I left
SidekiqUniqueJobs.config.lock_timeout = 2
, which probably negates the improvements you made in #728, as it seems that effectively ensured there was a timeout for the Redis call, even if no timeout was configured.If all goes well, I’ll try without the
lock_timeout
config being set to really exercise the changes in #728.@ezekg I’ll see about adding toxiproxy and reproduce this in a test. There was a ton of work going into stabilizing the gem and this is one of the few problems I never got to the absolute bottom of.
I’ll make it my main priority in the next week.
Thanks for the additional info. I actually think I’m going to try to switch to an
until_executing
locking strategy, since that seems to make more sense in my case, and perhaps it will resolve my issue here. Can’t afford to lose these jobs.@mhenrixon, I sponsored the repo on behalf of @keygen-sh. If you have time to look into this, LMK.