concourse: duplicate key value violates unique constraint "pipeline_build_events_22_build_id_event_id"

Summary

Since I’ve upgraded to 7.4.1 I’ve been having get tasks fail with the following error: save image get event: pq: duplicate key value violates unique constraint "pipeline_build_events_22_build_id_event_id"

Steps to reproduce

Configure pipeline with various concurrent jobs that contain a resource get and will automatically trigger at various times during the day.
Trigger various jobs at the same time.
Sometimes you will get a duplicate key error.

Expected results

No duplicate key error.

Actual results

Duplicate key errors will happen.

Additional context

Likely this is due to changes in #7641 to move event id incrementing to happen in memory instead of using a Postgres Sequence.

Triaging info

Concourse version: 7.4.1
Did this used to work: This did not happen on 7.4.0

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 30 (9 by maintainers)

Commits related to this issue

refresh event ID during reload We reload the db build object being starting the build within the engine. By refreshing the event id sequence here, we will load up the event id into memory before mult... — committed to concourse/concourse by clarafu 3 years ago
refresh event ID during reload We reload the db build object being starting the build within the engine. By refreshing the event id sequence here, we will load up the event id into memory before mult... — committed to concourse/concourse by clarafu 3 years ago
atc: hide event id behind an interface Before, we were accessing the event ID variable within the build object in order to check if it is equal to 0. But we were only doing that in order to check if ... — committed to concourse/concourse by clarafu 3 years ago
atc: remove for share lock on resource config I assume that we acquired a FOR SHARE lock on SELECTing from the resource configs in order to later update that resource config's last referenced column.... — committed to concourse/concourse by clarafu 3 years ago
atc: acquire share lock when not updating last ref In the last commit I completely removed the share lock from the SELECT query for resource configs because I assumed it isnt necessary to lock it whe... — committed to concourse/concourse by clarafu 3 years ago
switch baseline environment to 7.4.2 I want to test out 7.4.2 on scale in order to try to reproduce a deadlock bug concourse/concourse#7683 Signed-off-by: Clara Fu <fclara@vmware.com> — committed to concourse/infrastructure by clarafu 3 years ago
a silly mistake.. [#7683] Signed-off-by: Clara Fu <fclara@vmware.com> — committed to concourse/concourse by clarafu 3 years ago

Most upvoted comments

@muntac @taylorsilva I just noticed this issue. I think the root cause is not at atomic, instead the problem is at https://github.com/concourse/concourse/blob/b8f9da5b8c00b6112a7a334b3dccdc49962859f9/atc/db/build.go#L1905-L1909.

When multiple threads call saveEvent at the same time, they may refreshEventIdSeq at the same time, so that they get the same sequence id.

In master branch (my PR), it has refreshEventIdSeq in build.Reload() https://github.com/concourse/concourse/blob/45d20d03d92c4650daa884861fcf21b457b352e6/atc/db/build.go#L414-L417

Where build.Reload() is called before step goroutines start, thus avoids contention problem. So, for this issue, I think we should also add refreshEventIdSeq in build.Reload().

evanchaoli on Oct 26, 2021

The PR mentioned in OP is a partial backport of some code from https://github.com/concourse/concourse/pull/7208 This is probably affecting the currently unreleased code on master as well. We’ve added this to the roadmap and will dig into it.

taylorsilva on Oct 21, 2021

So far so good. I’ll continue running my pipelines for the rest of the day to confirm.

Eeems on Nov 1, 2021

I’ve updated and I’ll report back on if it solves the issue for me. I’m expecting that I will also encounter the deadlock issue that @qzk reported.

it looks to be tied to the number of check builds running. it works fine with only a few check builds running but after unpausing probably about 600 pipelines which caused about 4000 check builds to run, the deadlock issues start appearing (whereas 7.2.0 has no problem here)

qzk on Nov 1, 2021

The PR mentioned in OP is a partial backport of some code from #7208 This is probably affecting the currently unreleased code on master as well. We’ve added this to the roadmap and will dig into it.

Let me know if you need more information. I’m hitting this every day and basically having to trigger new builds multiple times per job in order to get it to actually run.

Eeems on Oct 21, 2021