concourse: duplicate key value violates unique constraint "pipeline_build_events_22_build_id_event_id"
Summary
Since I’ve upgraded to 7.4.1 I’ve been having get tasks fail with the following error:
save image get event: pq: duplicate key value violates unique constraint "pipeline_build_events_22_build_id_event_id"
Steps to reproduce
- Configure pipeline with various concurrent jobs that contain a resource get and will automatically trigger at various times during the day.
- Trigger various jobs at the same time.
- Sometimes you will get a duplicate key error.
Expected results
No duplicate key error.
Actual results
Duplicate key errors will happen.
Additional context
Likely this is due to changes in #7641 to move event id incrementing to happen in memory instead of using a Postgres Sequence.
Triaging info
- Concourse version: 7.4.1
- Did this used to work: This did not happen on 7.4.0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 30 (9 by maintainers)
Commits related to this issue
- refresh event ID during reload We reload the db build object being starting the build within the engine. By refreshing the event id sequence here, we will load up the event id into memory before mult... — committed to concourse/concourse by clarafu 3 years ago
- refresh event ID during reload We reload the db build object being starting the build within the engine. By refreshing the event id sequence here, we will load up the event id into memory before mult... — committed to concourse/concourse by clarafu 3 years ago
- atc: hide event id behind an interface Before, we were accessing the event ID variable within the build object in order to check if it is equal to 0. But we were only doing that in order to check if ... — committed to concourse/concourse by clarafu 3 years ago
- atc: remove for share lock on resource config I assume that we acquired a FOR SHARE lock on SELECTing from the resource configs in order to later update that resource config's last referenced column.... — committed to concourse/concourse by clarafu 3 years ago
- atc: acquire share lock when not updating last ref In the last commit I completely removed the share lock from the SELECT query for resource configs because I assumed it isnt necessary to lock it whe... — committed to concourse/concourse by clarafu 3 years ago
- switch baseline environment to 7.4.2 I want to test out 7.4.2 on scale in order to try to reproduce a deadlock bug concourse/concourse#7683 Signed-off-by: Clara Fu <fclara@vmware.com> — committed to concourse/infrastructure by clarafu 3 years ago
- a silly mistake.. [#7683] Signed-off-by: Clara Fu <fclara@vmware.com> — committed to concourse/concourse by clarafu 3 years ago
@muntac @taylorsilva I just noticed this issue. I think the root cause is not at
atomic, instead the problem is at https://github.com/concourse/concourse/blob/b8f9da5b8c00b6112a7a334b3dccdc49962859f9/atc/db/build.go#L1905-L1909.When multiple threads call
saveEventat the same time, they mayrefreshEventIdSeqat the same time, so that they get the same sequence id.In master branch (my PR), it has
refreshEventIdSeqinbuild.Reload()https://github.com/concourse/concourse/blob/45d20d03d92c4650daa884861fcf21b457b352e6/atc/db/build.go#L414-L417Where
build.Reload()is called before step goroutines start, thus avoids contention problem. So, for this issue, I think we should also addrefreshEventIdSeqinbuild.Reload().The PR mentioned in OP is a partial backport of some code from https://github.com/concourse/concourse/pull/7208 This is probably affecting the currently unreleased code on
masteras well. We’ve added this to the roadmap and will dig into it.So far so good. I’ll continue running my pipelines for the rest of the day to confirm.
it looks to be tied to the number of check builds running. it works fine with only a few check builds running but after unpausing probably about 600 pipelines which caused about 4000 check builds to run, the deadlock issues start appearing (whereas 7.2.0 has no problem here)
Let me know if you need more information. I’m hitting this every day and basically having to trigger new builds multiple times per job in order to get it to actually run.