concourse: No Workers error happening a lot
Bug Report
On a smaller CI installation, we are getting the “no workers” error a lot more than I think we should. This can happen when checking a resource, or when a new job is requested from the web interface.
Initially we were getting this error with just 1 worker and we assumed adding a second worker would fix the problem. It did not.
The worker(s) are on the same machine as concourse-web which should mostly rule out network issues. Is there some known issue with running the concourse-web and concourse-worker instances on the same machine?
Steps to Reproduce
- Set up a binary installation where the worker(s) and web nodes are on the same machine
- Set up a somewhat moderate CI environment (Vault integration, 5 pipelines, each with 1 or 2 resources and just 1 job/task) (If you can get your tasks to do “cpu intensive stuff” like compiling C/C++ code that is even closer to what we are doing)
- Manually kick all pipelines off, kick some pipelines off more than once so they run in parallel
- Go into pipelines at random and select the resources and click the refresh resource icon to manually recheck the resource for new versions
- Keep doing stuff in the concourse CI UI and eventually you’ll get an error.
Expected Results
Workers don’t get marked as stalled and everything just works.
Actual Results
Workers are getting marked as stalled and everything that is queued up installed fails with “no workers” error.
Additional Context
There’s lots of noise in the log but here’s a relevant part journalctl -u concourse-web
Jul 15 13:41:01 hostname concourse[1293]: {"timestamp":"2019-07-15T18:41:01.307793431Z","level":"info","source":"atc","message":"atc.collector.tick.worker-collector.marked-workers-as-stalled","data":{"count":2,"session":"21.90.3","workers":["hostname-builder-2","hostname-builder"]}}
Jul 15 13:41:01 hostname concourse[1293]: {"timestamp":"2019-07-15T18:41:01.340907222Z","level":"error","source":"atc","message":"atc.pipelines.radar.scan-resource.interval-runner.tick.failed-to-choose-a-worker","data":{"error":"no workers","pipeline":"p-testing","resource":"resource-p-repo","session":"18.13.1.1.5","team":"main"}}
Jul 15 13:41:01 hostname concourse[1293]: {"timestamp":"2019-07-15T18:41:01.344311855Z","level":"error","source":"atc","message":"atc.pipelines.radar.failed-to-run-scan-resource","data":{"error":"no workers","pipeline":"p-testing","session":"18.13","team":"main"}}
Fly tells us there are actually 2 workers available. fly -t local workers
name containers platform tags team state version
hostname-builder 6 linux none none running 2.1
hostname-builder-2 8 linux none none running 2.1
Version Info
- Concourse version: 5.1.0
- Deployment type (BOSH/Docker/binary): binary
- Infrastructure/IaaS: Bare Metal
- Browser (if applicable): N/A
- Did this used to work? No, been affected by this since at least 4.0
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 28 (8 by maintainers)
Ok, I’m going to set it to 90, restart the web node and see how that goes. Thanks for the tip!