concourse: Resource checking should be resilient to a 'stuck' worker

What challenge are you facing?

Prod had its concourse resource stop checking for over four days. Investigating the stack dump revealed that a lock was held open and the goroutine holding it was stuck on this line:

https://github.com/concourse/concourse/blob/8d6219d0e1d82ef897838d229dbe4394dd31ecef/atc/resource/run_script.go#L136

goroutine 1021 [chan receive, 5834 minutes]:
github.com/concourse/concourse/atc/resource.(*resource).runScript(0xc039940220, 0x2607d20, 0xc01888ae40, 0x1ced6e0, 0x13, 0x0, 0x0, 0x0, 0x1b073c0, 0xc03959f700, ...)
	/tmp/build/70f2e240/concourse/atc/resource/run_script.go:136 +0x724
github.com/concourse/concourse/atc/resource.(*resource).Check(0xc039940220, 0x2607d20, 0xc01888ae40, 0xc0043bc480, 0xc004e71710, 0xc03959f6f0, 0x38384c8, 0x25e54e0, 0xc02c816900, 0x1cd7271)
	/tmp/build/70f2e240/concourse/atc/resource/resource_check.go:17 +0x11c
github.com/concourse/concourse/atc/radar.(*resourceScanner).check(0xc000e2cc60, 0x2627140, 0xc018487920, 0x2656480, 0xc046598f20, 0x26347a0, 0xc035b89f20, 0xc004e71710, 0xc01100c000, 0x8, ...)
	/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:360 +0xbeb
github.com/concourse/concourse/atc/radar.(*resourceScanner).scan(0xc000e2cc60, 0x2627140, 0xc018487920, 0x765, 0x0, 0x0, 0x0, 0x0, 0x0)
	/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:251 +0x7d6
github.com/concourse/concourse/atc/radar.(*resourceScanner).Run(0xc000e2cc60, 0x2627140, 0xc000e8dec0, 0x765, 0xdf8475801, 0x25c7ce0, 0xc00022e940)
	/tmp/build/70f2e240/concourse/atc/radar/resource_scanner.go:63 +0x97
github.com/concourse/concourse/atc/radar.(*intervalRunner).Run(0xc000a36b40, 0x2607ca0, 0xc000cee800, 0x0, 0x0)
	/tmp/build/70f2e240/concourse/atc/radar/interval_runner.go:68 +0x2d6
github.com/concourse/concourse/atc/radar.(*Runner).scanResources.func1(0xc000c7dbd0, 0x2627140, 0xc000e8cd80, 0x2607ca0, 0xc000cee800, 0x2656480, 0xc0015998c0, 0xc0003e8200, 0x10)
	/tmp/build/70f2e240/concourse/atc/radar/runner.go:118 +0x11d
created by github.com/concourse/concourse/atc/radar.(*Runner).scanResources
	/tmp/build/70f2e240/concourse/atc/radar/runner.go:113 +0x336

(full stack available here)

What would make this better?

It looks like it got stuck waiting on the process to exit after enforcing the default resource check timeout (1 hour). While stuck in this way the lock is still being held.

We probably do this to ensure the goroutine calling .Wait eventually exits. In this case, though, the worker was pretty broken and seems to have never returned the exit status. Supposedly the worker was even re-created, so it’s kind of odd that things got stuck instead of eventually erroring or something from the worker being destroyed.

  1. I would like to start with reproducing the issue first. Can we simulate a ‘stuck’ worker and replicate this failure mode? Even better, can a TopGun test be written?
  2. Can we enforce the timeout more cleanly? It would be nice if we can just give up client-side, release all client resources (goroutines, in-flight requests, etc.) and leave everything else up to the worker, if it’s even capable of cleaning up its side of things. Right now the code will handle a legitimately hanging check process pretty cleanly, but if the problem is more fundamental our ‘giving up’ code…doesn’t give up.
  3. Should we shorten the default check timeout? 1 hour is super long. 🤔

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 18 (18 by maintainers)

Commits related to this issue

Most upvoted comments

This is WIP

List of interface methods that interact with Worker during a Resource Scanner check that will need a context passed in

  • worker.FindOrCreateContainer
    • gardenClient.Lookup
      • gardenClient.Connection.List
    • gardenClient.Create
    • volumeClient.FindOrCreateVolumeForContainer
      • volumeClient.LookupVolume
      • baggageclaimClient.LookupVolume
      • baggageclaimClient.CreateVolume
  • resource.Check

case 1 : ctx timeout occurs and pipeline fails as its been too long

Observations

  • had to also change heartbeat interval to > ctx timeout. Otherwise, received no workers error if execution had progressed to selecting a chosenWorker.
  • had to change retryer to account for ctx deadline exceeded and not remain in an infinite loop

Conclusions

  • On a stuck worker, heartbeating wasn’t failed, otherwise, error no workers is bubbled up.

case 1a: garden blocks a request indefinitely & ctx timeout occurs

Observations

  • returns error ctx deadline exceeded
  • used toxiproxy to introduce infinite latency

Conclusions

  • without the ctx timeout, the execution is blocked indefinitely ( This seems to correspond with the observed behaviour in the bug report )

case 2 : retry backoff timeout occurs and pipeline fails as its been too long

Observations

  • it does give up after 5 mins ( ~23 retries ) and bubbles the error up.

Conclusions

  • So this must imply that the issue was a request to garden that hung indefinitely. Need to test 1a to confirm.

talked to garden folks, they’re open to us making a pr for contexts in the garden client.

if we’re going to make the switch to containerd, i’d wait for that to happen since its client is built properly with contexts. https://github.com/containerd/containerd/blob/master/client.go