concourse: Aggregate issue: builds stuck in "pending" state

There have been a few reports lately of jobs being unable to schedule. It’s been difficult for us to progress on this as we never see it ourselves and there’s generally not enough information provided to find a smoking gun. Which is no one’s fault really - we haven’t given great instructions as to what information would be valuable and how to collect it! So, that’s what this issue is for.

If you’re seeing this problem, please include a screenshot of the build preparation section of the build view.

Also answer the following:

  • Are all of your workers present? (fly workers)
  • Is there a check container present for each of the inputs to your job? (fly hijack -c pipeline-name/resource-name)
    • Do any of the check containers have a running /opt/resource/check process? If so, that may be hanging. What resource type is it?
  • What is the uptime of your workers and ATC?
  • Are your workers registered directly (BOSH default) or forwarded through the TSA (binary default/external workers registering with BOSH deployment)?
  • Which IaaS?
    • If you’re on GCP, have you configured the MTU of your workers to be 1460 to match the VM? If not, it defaults to 1500, which would cause things to hang.
    • Can you reach the workers from your ATC? (curl http://<worker ip:port>/containers) You can collect the IP + port from fly workers -d.

In addition to that, the most valuable information will be stack dumps of both the ATC and the TSA.

You can collect a stack dump from the ATC by running:

curl http://127.0.0.1:8079/debug/pprof/goroutine?debug=2

…and from the TSA by sending SIGQUIT and collecting the output from stderr. Note that if you’re running the binaries, the above curl command will include the TSA’s stack, so don’t worry about getting it separately. Also note that SIGQUIT will kill the TSA process, so you’ll need to bring it back after. (While you’re at it, let us know if that fixed it. 😛)

Thanks all for your patience, sorry we’re taking so long to get to the bottom of this.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 39
  • Comments: 52 (23 by maintainers)

Most upvoted comments

For us this is definitely triggered by #750. Whenever we loose a worker with a deadlocked SSH connection we usually have 2-3 jobs hanging in pending with everything checked like this or this.

Each time at least one check container of the jobs inputs is missing. The only way to bring back the missing containers seems to be restarting the web component.

My assumption is that the check containers that are missing where located on the missing worker with the deadlocked SSH connection on the TSA side. While resolving #750 would likely fix this issue for us I think concourse should be able to recover from this even without it. The worker with the stuck connection is long gone and so is the check container. The scheduler should in priciple be able reschedule the check container on a different worker.

I’m going to close this as we fixed the underlying root cause for the many reports we received at the time. All the recent reports appear to be a misconfiguration or some other problem with Concourse itself working as expected (build is pending and it tells you why, e.g. there are no versions, which is up to you to solve).

I ran into this same issue today (stuck on “discovering new versions”) on a bosh deployed instance of Concourse 2.7.3 and noticed 1 of the 3 workers had a very high load. I was able to resolve the problem by manually sshing into the offending worker and restarting it.

Here’s a snippet of the container load before restarting:

name                                  containers  platform  tags  team  state
98b9dcfa-e345-4317-bcbf-ba5cbfbb622a  15          linux     none  none  running
a1435e37-a2e3-40e0-98fa-2c257f7666d2  74          linux     none  none  running
afa869ea-e91c-4054-918d-3e05b6034053  12          linux     none  none  running

And an image of the worker load: concourse_workers_hanging

After the worker came back up it seems like the container load was properly distributed again:

name                                  containers  platform  tags  team  state
98b9dcfa-e345-4317-bcbf-ba5cbfbb622a  72          linux     none  none  running
a1435e37-c887-447e-9862-49ca3a5bd217  48          linux     none  none  running
afa869ea-e91c-4054-918d-3e05b6034053  64          linux     none  none  running

Hello, Concourse folks! We’ve been seeing this one with increasing regularity since upgrading to the most recent release (but we came from 1.6 so 🤷‍♂️). Anyway, here are the requested diagnostics. Please let us know if we can provide any more information or help out. Thanks!

Screenshot of a hanging build

screen shot 2017-02-10 at 5 02 39 pm

ATC stack trace

atc_stack.txt

Are all of your workers present? (fly workers)

Yes

 fly -t aae-concourse workers -d
name                              containers  platform  tags  team  state    garden address   baggageclaim url        resource types                                                                                                                                        
concourse-ui-ndc-as-b-blue        75          linux     none  none  running  127.0.0.1:53457  http://127.0.0.1:42420  archive, bosh-deployment, bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, pool, s3, semver, time, tracker, vagrant-cloud
concourse-worker-ndc-as-b-blue-0  63          linux     none  none  running  127.0.0.1:50318  http://127.0.0.1:51040  archive, bosh-deployment, bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, pool, s3, semver, time, tracker, vagrant-cloud
concourse-worker-ndc-as-b-blue-1  79          linux     none  none  running  127.0.0.1:51982  http://127.0.0.1:39904  archive, bosh-deployment, bosh-io-release, bosh-io-stemcell, cf, docker-image, git, github-release, hg, pool, s3, semver, time, tracker, vagrant-cloud

Is there a check container present for each of the inputs to your job?

No check containers. All containers for this job appear to be for builds other than the hanging one.

Uptime of workers and ATC

ATC: Process: 2.5h, System: 7d, OS: Ubuntu 14.04, Kernel: 3.19.0-79-generic

Workers: concourse-ui-ndc-as-b-blue: Process: 7h, System: 7d, OS: Ubuntu 14.04, Kernel: 3.19.0-79-generic concourse-worker-ndc-as-b-blue-0: Process: ~21d, System: 28d, OS: Ubuntu 14.04, Kernel: 3.19.0-79-generic concourse-worker-ndc-as-b-blue-1: Process: ~21d, System: 28d, OS: Ubuntu 14.04, Kernel: 3.19.0-79-generic

Are your workers registered directly (BOSH default) or forwarded through the TSA (binary default/external workers registering with BOSH deployment)?

Forwarded thru the ATC (binary default)

IaaS

OpenStack

Can you reach the workers from your ATC?

Yes