concourse: Resource checking should be resilient to a 'stuck' worker

What challenge are you facing?

Prod had its concourse resource stop checking for over four days. Investigating the stack dump revealed that a lock was held open and the goroutine holding it was stuck on this line:

https://github.com/concourse/concourse/blob/8d6219d0e1d82ef897838d229dbe4394dd31ecef/atc/resource/run_script.go#L136

goroutine 1021 [chan receive, 5834 minutes]:
github.com/concourse/concourse/atc/resource.(*resource).runScript(0xc039940220, 0x2607d20, 0xc01888ae40, 0x1ced6e0, 0x13, 0x0, 0x0, 0x0, 0x1b073c0, 0xc03959f700, ...)
	/tmp/build/70f2e240/concourse/atc/resource/run_script.go:136 +0x724
github.com/concourse/concourse/atc/resource.(*resource).Check(0xc039940220, 0x2607d20, 0xc01888ae40, 0xc0043bc480, 0xc004e71710, 0xc03959f6f0, 0x38384c8, 0x25e54e0, 0xc02c816900, 0x1cd7271)
	/tmp/build/70f2e240/concourse/atc/resource/resource_check.go:17 +0x11c
github.com/concourse/concourse/atc/radar.(*resourceScanner).check(0xc000e2cc60, 0x2627140, 0xc018487920, 0x2656480, 0xc046598f20, 0x26347a0, 0xc035b89f20, 0xc004e71710, 0xc01100c000, 0x8, ...)
	/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:360 +0xbeb
github.com/concourse/concourse/atc/radar.(*resourceScanner).scan(0xc000e2cc60, 0x2627140, 0xc018487920, 0x765, 0x0, 0x0, 0x0, 0x0, 0x0)
	/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:251 +0x7d6
github.com/concourse/concourse/atc/radar.(*resourceScanner).Run(0xc000e2cc60, 0x2627140, 0xc000e8dec0, 0x765, 0xdf8475801, 0x25c7ce0, 0xc00022e940)
	/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:63 +0x97
github.com/concourse/concourse/atc/radar.(*intervalRunner).Run(0xc000a36b40, 0x2607ca0, 0xc000cee800, 0x0, 0x0)
	/tmp/build/70f2e240/concourse/atc/radar/interval_runner.go:68 +0x2d6
github.com/concourse/concourse/atc/radar.(*Runner).scanResources.func1(0xc000c7dbd0, 0x2627140, 0xc000e8cd80, 0x2607ca0, 0xc000cee800, 0x2656480, 0xc0015998c0, 0xc0003e8200, 0x10)
	/tmp/build/70f2e240/concourse/atc/radar/runner.go:118 +0x11d
created by github.com/concourse/concourse/atc/radar.(*Runner).scanResources
	/tmp/build/70f2e240/concourse/atc/radar/runner.go:113 +0x336

(full stack available here)

What would make this better?

It looks like it got stuck waiting on the process to exit after enforcing the default resource check timeout (1 hour). While stuck in this way the lock is still being held.

We probably do this to ensure the goroutine calling .Wait eventually exits. In this case, though, the worker was pretty broken and seems to have never returned the exit status. Supposedly the worker was even re-created, so it’s kind of odd that things got stuck instead of eventually erroring or something from the worker being destroyed.

I would like to start with reproducing the issue first. Can we simulate a ‘stuck’ worker and replicate this failure mode? Even better, can a TopGun test be written?
Can we enforce the timeout more cleanly? It would be nice if we can just give up client-side, release all client resources (goroutines, in-flight requests, etc.) and leave everything else up to the worker, if it’s even capable of cleaning up its side of things. Right now the code will handle a legitimately hanging check process pretty cleanly, but if the problem is more fundamental our ‘giving up’ code…doesn’t give up.
Should we shorten the default check timeout? 1 hour is super long. 🤔

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 5
Comments: 18 (18 by maintainers)

Commits related to this issue

[#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
[#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
[#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
[#3928] add context to comply with signature for task_step — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
[#3928] Resource checking is resilient to a `stuck` worker - Garden Client requests have a 5 min timeout for short lived http requests - Extend Container Run & Attach with ctx that will underlying... — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago
[#3928] add context to comply with signature for task_step — committed to concourse/concourse by xtreme-sameer-vohra 5 years ago

Most upvoted comments

This is WIP

List of interface methods that interact with Worker during a Resource Scanner check that will need a context passed in

worker.FindOrCreateContainer
- gardenClient.Lookup
  - gardenClient.Connection.List
- gardenClient.Create
- volumeClient.FindOrCreateVolumeForContainer
  - volumeClient.LookupVolume
  - baggageclaimClient.LookupVolume
  - baggageclaimClient.CreateVolume
resource.Check

case 1 : ctx timeout occurs and pipeline fails as its been too long

Observations

had to also change heartbeat interval to > ctx timeout. Otherwise, received no workers error if execution had progressed to selecting a chosenWorker.
had to change retryer to account for ctx deadline exceeded and not remain in an infinite loop

Conclusions

On a stuck worker, heartbeating wasn’t failed, otherwise, error no workers is bubbled up.

case 1a: garden blocks a request indefinitely & ctx timeout occurs

Observations

returns error ctx deadline exceeded
used toxiproxy to introduce infinite latency

Conclusions

without the ctx timeout, the execution is blocked indefinitely ( This seems to correspond with the observed behaviour in the bug report )

case 2 : retry backoff timeout occurs and pipeline fails as its been too long

Observations

it does give up after 5 mins ( ~23 retries ) and bubbles the error up.

Conclusions

So this must imply that the issue was a request to garden that hung indefinitely. Need to test 1a to confirm.

xtreme-sameer-vohra on Jul 19, 2019

talked to garden folks, they’re open to us making a pr for contexts in the garden client.

kcmannem on Jun 13, 2019

if we’re going to make the switch to containerd, i’d wait for that to happen since its client is built properly with contexts. https://github.com/containerd/containerd/blob/master/client.go

kcmannem on Jun 13, 2019