sidekiq-unique-jobs: until_and_while_executing jobs can conflict with themselves

Describe the bug Jobs for workers that use until_and_while_executing sometimes get stuck because they appear to conflict with themselves.

Expected behavior Executing StatusRequestWorker.perform_async(123) should run the job.

Current behavior Note: The problem only occurs sporadically.

A SidekiqUniqueJobs::Conflict error is raised. The error message says “Item with the key: uniquejobs:0a7563aa8a2c1b4d81c6c8d02a4b3e96:RUN is already scheduled or processing”.

Further inspection of the locks reveals that the lock is taken by the job itself (JID 28b40e216188051797e08954 in this example). Since the lock is never released, the job does not execute and eventually dies once the retry attempts are exhausted.

Worker class

class StatusRequestWorker
  include Sidekiq::Worker

  sidekiq_options lock: :until_and_while_executing, on_conflict: { client: :log, server: :raise }

  def perform(id)
    # ...
  end
end

Additional context

Using sidekiq-unique-jobs v7.1.5

Screenshots:

Screenshot 2021-08-06 at 15-38-47 Sidekiq Pro

Screenshot 2021-08-06 at 15-38-57 Sidekiq Pro

About this issue

Original URL
State: open
Created 3 years ago
Comments: 16 (7 by maintainers)

Commits related to this issue

update heartbeat worker uniqueness strategy - related to https://github.com/mhenrixon/sidekiq-unique-jobs/issues/630 — committed to keygen-sh/keygen-api by ezekg 2 years ago

Most upvoted comments

I think it’s a sidekiq issue, not a toxiproxy issue.

First, I think you need sidekiq to actually perform the job inline, not just enqueue (the default when including “sidekiq/testing”, ie. fake!):

Sidekiq::Testing.inline!

I believe the root bug is around job execution, not job enqueueing, hence the importance of actually performing the job. The errors I saw in my application were always when sidekiq tried to perform the job, never when sidekiq tried to enqueue the job.

Second, when using the sidekiq test harness, no server middleware is run, so you’re not actually executing any of the unique locking code. You need to configure test middleware separately:

Sidekiq::Testing.server_middleware do |chain|
  chain.add SidekiqUniqueJobs::Middleware::Server
end

More info: https://github.com/mperham/sidekiq/wiki/Testing#testing-server-middleware

If I add those two bits to your before block in your PR, the test fails (as we’re expecting). If I reduce the toxiproxy latency to 20 and ditch the jitter (matching my repro), the test also fails. If I drop the toxiproxy latency to 15, the test passes, so at least proving with little-to-no latency, things proceed without error.

Progress!

cgunther on Jul 27, 2022

I dug further into this and believe I worked up a reproduction script: https://gist.github.com/cgunther/718379c4786e445e75a4e62ce612a9ee

I think it’s a latency issue. Using toxiproxy with just 20ms latency I can reliably reproduce the issue via that script, just be sure toxiproxy is running first (toxiproxy-server).

Setting a 1 second lock time (SidekiqUniqueJobs.config.lock_timeout = 1) makes the test pass.

Ultimately I’m not sure if this is a bug or working as designed/configured. The locksmith calls pop_queued asynchronously, and without a timeout that ends up calling rpoplpush, then the promise wrapping pop_queued seems to timeout quickly with the lack of a timeout set, and rpoplpush hasn’t returned yet, so the gem kind of aborts, but the rpoplpush call eventually succeeds, leaving data in redis, which then causes issues when retried? Or at least that’s what I’m thinking is happening?

There are notes about this in the readme about the while executing lock:

NOTE Unless a conflict strategy of :raise is specified, if lock fails, the job will be dropped without notice. When told to raise, the job will be put back and retried. It would also be possible to use :reschedule with this lock.

NOTE Unless this job is configured with a lock_timeout: nil or lock_timeout: > 0 then all jobs that are attempted to be executed will just be dropped without waiting.

But there still seems to be a conflict in that when specify the conflict strategy of :raise, while your job is technically retried, it seems it’s unlikely to ever succeed due to the lingering lock, not exactly the “retry” behavior I’d expect. Should the gem try to clean up the lock in the case of such a timeout, but then I’m not sure how to ensure we’ve waited long enough for the rpoplpush to have finished before clearing the lock.

That’s as far as I’ve gotten so far.

cgunther on Feb 14, 2022

I have an idea, I’ll see if I can find a way to finally fix this in the next few days.

mhenrixon on Aug 6, 2021

I deployed v7.1.26, so will keep an eye on things and report back.

For now, I left SidekiqUniqueJobs.config.lock_timeout = 2, which probably negates the improvements you made in #728, as it seems that effectively ensured there was a timeout for the Redis call, even if no timeout was configured.

If all goes well, I’ll try without the lock_timeout config being set to really exercise the changes in #728.

cgunther on Jul 29, 2022

@ezekg I’ll see about adding toxiproxy and reproduce this in a test. There was a ton of work going into stabilizing the gem and this is one of the few problems I never got to the absolute bottom of.

I’ll make it my main priority in the next week.

mhenrixon on Jul 23, 2022

Thanks for the additional info. I actually think I’m going to try to switch to an until_executing locking strategy, since that seems to make more sense in my case, and perhaps it will resolve my issue here. Can’t afford to lose these jobs.

@mhenrixon, I sponsored the repo on behalf of @keygen-sh. If you have time to look into this, LMK.

ezekg on Jul 22, 2022