concourse: Resource checking should be resilient to a 'stuck' worker
What challenge are you facing?
Prod had its concourse resource stop checking for over four days. Investigating the stack dump revealed that a lock was held open and the goroutine holding it was stuck on this line:
goroutine 1021 [chan receive, 5834 minutes]:
github.com/concourse/concourse/atc/resource.(*resource).runScript(0xc039940220, 0x2607d20, 0xc01888ae40, 0x1ced6e0, 0x13, 0x0, 0x0, 0x0, 0x1b073c0, 0xc03959f700, ...)
/tmp/build/70f2e240/concourse/atc/resource/run_script.go:136 +0x724
github.com/concourse/concourse/atc/resource.(*resource).Check(0xc039940220, 0x2607d20, 0xc01888ae40, 0xc0043bc480, 0xc004e71710, 0xc03959f6f0, 0x38384c8, 0x25e54e0, 0xc02c816900, 0x1cd7271)
/tmp/build/70f2e240/concourse/atc/resource/resource_check.go:17 +0x11c
github.com/concourse/concourse/atc/radar.(*resourceScanner).check(0xc000e2cc60, 0x2627140, 0xc018487920, 0x2656480, 0xc046598f20, 0x26347a0, 0xc035b89f20, 0xc004e71710, 0xc01100c000, 0x8, ...)
/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:360 +0xbeb
github.com/concourse/concourse/atc/radar.(*resourceScanner).scan(0xc000e2cc60, 0x2627140, 0xc018487920, 0x765, 0x0, 0x0, 0x0, 0x0, 0x0)
/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:251 +0x7d6
github.com/concourse/concourse/atc/radar.(*resourceScanner).Run(0xc000e2cc60, 0x2627140, 0xc000e8dec0, 0x765, 0xdf8475801, 0x25c7ce0, 0xc00022e940)
/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:63 +0x97
github.com/concourse/concourse/atc/radar.(*intervalRunner).Run(0xc000a36b40, 0x2607ca0, 0xc000cee800, 0x0, 0x0)
/tmp/build/70f2e240/concourse/atc/radar/interval_runner.go:68 +0x2d6
github.com/concourse/concourse/atc/radar.(*Runner).scanResources.func1(0xc000c7dbd0, 0x2627140, 0xc000e8cd80, 0x2607ca0, 0xc000cee800, 0x2656480, 0xc0015998c0, 0xc0003e8200, 0x10)
/tmp/build/70f2e240/concourse/atc/radar/runner.go:118 +0x11d
created by github.com/concourse/concourse/atc/radar.(*Runner).scanResources
/tmp/build/70f2e240/concourse/atc/radar/runner.go:113 +0x336
(full stack available here)
What would make this better?
It looks like it got stuck waiting on the process to exit after enforcing the default resource check timeout (1 hour). While stuck in this way the lock is still being held.
We probably do this to ensure the goroutine calling .Wait eventually exits. In this case, though, the worker was pretty broken and seems to have never returned the exit status. Supposedly the worker was even re-created, so it’s kind of odd that things got stuck instead of eventually erroring or something from the worker being destroyed.
- I would like to start with reproducing the issue first. Can we simulate a ‘stuck’ worker and replicate this failure mode? Even better, can a TopGun test be written?
- Can we enforce the timeout more cleanly? It would be nice if we can just give up client-side, release all client resources (goroutines, in-flight requests, etc.) and leave everything else up to the worker, if it’s even capable of cleaning up its side of things. Right now the code will handle a legitimately hanging check process pretty cleanly, but if the problem is more fundamental our ‘giving up’ code…doesn’t give up.
- Should we shorten the default check timeout? 1 hour is super long. 🤔
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 18 (18 by maintainers)
Commits related to this issue
- [#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
- [#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
- [#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
- [#3928] add context to comply with signature for task_step — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
- [#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
- [#3928] add context to comply with signature for task_step — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
This is WIP
List of interface methods that interact with Worker during a Resource Scanner check that will need a
contextpassed incase 1 : ctx timeout occurs and pipeline fails as its been too long
Observations
no workerserror if execution had progressed to selecting achosenWorker.Conclusions
no workersis bubbled up.case 1a: garden blocks a request indefinitely & ctx timeout occurs
Observations
ctx deadline exceededtoxiproxyto introduce infinite latencyConclusions
case 2 : retry backoff timeout occurs and pipeline fails as its been too long
Observations
Conclusions
talked to garden folks, they’re open to us making a pr for contexts in the garden client.
if we’re going to make the switch to containerd, i’d wait for that to happen since its client is built properly with contexts. https://github.com/containerd/containerd/blob/master/client.go