concourse: No Workers error happening a lot

Bug Report

On a smaller CI installation, we are getting the “no workers” error a lot more than I think we should. This can happen when checking a resource, or when a new job is requested from the web interface.

Initially we were getting this error with just 1 worker and we assumed adding a second worker would fix the problem. It did not.

The worker(s) are on the same machine as concourse-web which should mostly rule out network issues. Is there some known issue with running the concourse-web and concourse-worker instances on the same machine?

Steps to Reproduce

  1. Set up a binary installation where the worker(s) and web nodes are on the same machine
  2. Set up a somewhat moderate CI environment (Vault integration, 5 pipelines, each with 1 or 2 resources and just 1 job/task) (If you can get your tasks to do “cpu intensive stuff” like compiling C/C++ code that is even closer to what we are doing)
  3. Manually kick all pipelines off, kick some pipelines off more than once so they run in parallel
  4. Go into pipelines at random and select the resources and click the refresh resource icon to manually recheck the resource for new versions
  5. Keep doing stuff in the concourse CI UI and eventually you’ll get an error.

Expected Results

Workers don’t get marked as stalled and everything just works.

Actual Results

Workers are getting marked as stalled and everything that is queued up installed fails with “no workers” error.

Additional Context

There’s lots of noise in the log but here’s a relevant part journalctl -u concourse-web

Jul 15 13:41:01 hostname concourse[1293]: {"timestamp":"2019-07-15T18:41:01.307793431Z","level":"info","source":"atc","message":"atc.collector.tick.worker-collector.marked-workers-as-stalled","data":{"count":2,"session":"21.90.3","workers":["hostname-builder-2","hostname-builder"]}}
Jul 15 13:41:01 hostname concourse[1293]: {"timestamp":"2019-07-15T18:41:01.340907222Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.failed-to-choose-a-worker","data":{"error":"no workers","pipeline":"p-testing","resource":"resource-p-repo","session":"18.13.1.1.5","team":"main"}}
Jul 15 13:41:01 hostname concourse[1293]: {"timestamp":"2019-07-15T18:41:01.344311855Z","level":"error","source":"atc","message":"atc.pipelines.radar.failed-to-run-scan-resource","data":{"error":"no workers","pipeline":"p-testing","session":"18.13","team":"main"}}

Fly tells us there are actually 2 workers available. fly -t local workers

name                containers  platform  tags  team  state    version
hostname-builder    6           linux     none  none  running  2.1
hostname-builder-2  8           linux     none  none  running  2.1

Version Info

  • Concourse version: 5.1.0
  • Deployment type (BOSH/Docker/binary): binary
  • Infrastructure/IaaS: Bare Metal
  • Browser (if applicable): N/A
  • Did this used to work? No, been affected by this since at least 4.0

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 28 (8 by maintainers)

Most upvoted comments

Ok, I’m going to set it to 90, restart the web node and see how that goes. Thanks for the tip!