concourse: Old workers stay in `stalled` state indefinitely after restart
Bug Report
After a rolling restart of concourse workers, old workers remain in the stalled state until manually pruned:
name containers platform tags team state version
0eecd635-cacd-413c-a9e5-2446e45ae176 40 linux none none running 1.1
2ae152da-cf79-429c-bb6f-921d8712d658 63 linux none none running 1.1
323e2c8f-4ef6-49bc-9771-0cd531ce70f4 41 linux none none running 1.1
5ba8fa75-1de7-4625-92f8-14a69c63aa21 0 linux iaas main running 1.1
63dbb4d5-0c5b-4043-8f5e-90ef886fb2d6 21 linux none none running 1.1
b8198ced-0601-4b05-ba49-0160ae4f3652 23 linux none none running 1.1
be65bedd-1701-411a-8859-768bb5df2e43 0 linux iaas main running 1.1
the following workers have not checked in recently:
name containers platform tags team state version
0eecd635-61ee-436d-b516-927e8888ca7d 122 linux none none stalled 1.1
0eecd635-f983-476a-ba9f-29e684b8b692 97 linux none none stalled 1.1
323e2c8f-6ece-430d-991c-b0a1b03f4ec9 92 linux none none stalled 1.1
323e2c8f-dfa4-44cf-8f62-c9ab636b1d0c 89 linux none none stalled 1.1
5ba8fa75-7805-476c-ac6e-c86a2d1f3562 1 linux iaas main stalled 1.1
5ba8fa75-caf3-4761-961b-7ea7533658f6 0 linux iaas main stalled 1.1
63dbb4d5-6eea-4853-880a-dac8d2766012 87 linux none none stalled 1.1
63dbb4d5-cfcb-4cfa-a036-a63d208b7eb7 105 linux none none stalled 1.1
b8198ced-5dcc-40b7-afc0-6abc316ee81e 119 linux none none stalled 1.1
b8198ced-dee5-453c-afc8-17f71b4b0a31 105 linux none none stalled 1.1
be65bedd-0029-4cbb-9a85-58af3fb8997f 1 linux iaas main stalled 1.1
be65bedd-c674-48c5-a0ad-44ab9006c5a9 0 linux iaas main stalled 1.1
Some of those stalled workers are from a recent restart; the others are from a restart over 24h ago.
I would expect old workers to get cleaned up during the deploy process. Or is this expected behavior–are operators supposed to manually prune old workers after every restart?
- Concourse version: 3.4
- Deployment type (BOSH/Docker/binary): bosh
- Infrastructure/IaaS: aws
- Browser (if applicable): n/a
- Did this used to work?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 4
- Comments: 16 (7 by maintainers)
I can reproduce this on kubernetes when a node VM is destroyed. The worker enters
stalledstate and once rescheduled a few seconds later does not recover without pruning.Worker logs just print
Process exited with status 1repeatedly.I’m experiencing the same problem with version
3.8.0using the official helm chart.As @ahume reported, I’m also only seeing worker logs just print
Process exited with status 1repeatedly.this is still true 🤔 (@vito)
I cobbled up this https://github.com/luispabon/concourse-workers-cleaner specifically for kubernetes as a cronjob for those who cannot yet upgrade Concourse. It’s a bit filthy but it does work.
I can confirm this is no longer an issue for me in Kubernetes after upgrading to Concourse 4.2.x and passing the
--ephemeralflag to the workers pod definition, like so:I am experiencing this behavior with our Kubernetes v1.11 (AWS EKS) cluster, it seems to happen once a node appears to go
unknown, I deploy Concourse v4.2.2 using Helm chart v3.7.3 and was able to get my workers back in arunningstate by issuing the commandfly -t cloud prune-workers -w <worker>, <worker>on all workers that were in astalledstate. It appeared I had to issue the command multiple times before the workers actually joined back. I am using persistent storage for the worker nodes and I am deploying the PostgresSQL Database externally.