concourse: duplicate key value violates unique constraint "pipeline_build_events_22_build_id_event_id"

Summary

Since I’ve upgraded to 7.4.1 I’ve been having get tasks fail with the following error: save image get event: pq: duplicate key value violates unique constraint "pipeline_build_events_22_build_id_event_id"

Steps to reproduce

  1. Configure pipeline with various concurrent jobs that contain a resource get and will automatically trigger at various times during the day.
  2. Trigger various jobs at the same time.
  3. Sometimes you will get a duplicate key error.

Expected results

No duplicate key error.

Actual results

Duplicate key errors will happen.

Additional context

Likely this is due to changes in #7641 to move event id incrementing to happen in memory instead of using a Postgres Sequence.

Triaging info

  • Concourse version: 7.4.1
  • Did this used to work: This did not happen on 7.4.0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 30 (9 by maintainers)

Commits related to this issue

Most upvoted comments

@muntac @taylorsilva I just noticed this issue. I think the root cause is not at atomic, instead the problem is at https://github.com/concourse/concourse/blob/b8f9da5b8c00b6112a7a334b3dccdc49962859f9/atc/db/build.go#L1905-L1909.

When multiple threads call saveEvent at the same time, they may refreshEventIdSeq at the same time, so that they get the same sequence id.

In master branch (my PR), it has refreshEventIdSeq in build.Reload() https://github.com/concourse/concourse/blob/45d20d03d92c4650daa884861fcf21b457b352e6/atc/db/build.go#L414-L417

Where build.Reload() is called before step goroutines start, thus avoids contention problem. So, for this issue, I think we should also add refreshEventIdSeq in build.Reload().

The PR mentioned in OP is a partial backport of some code from https://github.com/concourse/concourse/pull/7208 This is probably affecting the currently unreleased code on master as well. We’ve added this to the roadmap and will dig into it.

So far so good. I’ll continue running my pipelines for the rest of the day to confirm.

I’ve updated and I’ll report back on if it solves the issue for me. I’m expecting that I will also encounter the deadlock issue that @qzk reported.

it looks to be tied to the number of check builds running. it works fine with only a few check builds running but after unpausing probably about 600 pipelines which caused about 4000 check builds to run, the deadlock issues start appearing (whereas 7.2.0 has no problem here)

The PR mentioned in OP is a partial backport of some code from #7208 This is probably affecting the currently unreleased code on master as well. We’ve added this to the roadmap and will dig into it.

Let me know if you need more information. I’m hitting this every day and basically having to trigger new builds multiple times per job in order to get it to actually run.