concourse: Old workers stay in `stalled` state indefinitely after restart

Bug Report

After a rolling restart of concourse workers, old workers remain in the stalled state until manually pruned:

name                                  containers  platform  tags  team  state    version
0eecd635-cacd-413c-a9e5-2446e45ae176  40          linux     none  none  running  1.1
2ae152da-cf79-429c-bb6f-921d8712d658  63          linux     none  none  running  1.1
323e2c8f-4ef6-49bc-9771-0cd531ce70f4  41          linux     none  none  running  1.1
5ba8fa75-1de7-4625-92f8-14a69c63aa21  0           linux     iaas  main  running  1.1
63dbb4d5-0c5b-4043-8f5e-90ef886fb2d6  21          linux     none  none  running  1.1
b8198ced-0601-4b05-ba49-0160ae4f3652  23          linux     none  none  running  1.1
be65bedd-1701-411a-8859-768bb5df2e43  0           linux     iaas  main  running  1.1


the following workers have not checked in recently:

name                                  containers  platform  tags  team  state    version
0eecd635-61ee-436d-b516-927e8888ca7d  122         linux     none  none  stalled  1.1
0eecd635-f983-476a-ba9f-29e684b8b692  97          linux     none  none  stalled  1.1
323e2c8f-6ece-430d-991c-b0a1b03f4ec9  92          linux     none  none  stalled  1.1
323e2c8f-dfa4-44cf-8f62-c9ab636b1d0c  89          linux     none  none  stalled  1.1
5ba8fa75-7805-476c-ac6e-c86a2d1f3562  1           linux     iaas  main  stalled  1.1
5ba8fa75-caf3-4761-961b-7ea7533658f6  0           linux     iaas  main  stalled  1.1
63dbb4d5-6eea-4853-880a-dac8d2766012  87          linux     none  none  stalled  1.1
63dbb4d5-cfcb-4cfa-a036-a63d208b7eb7  105         linux     none  none  stalled  1.1
b8198ced-5dcc-40b7-afc0-6abc316ee81e  119         linux     none  none  stalled  1.1
b8198ced-dee5-453c-afc8-17f71b4b0a31  105         linux     none  none  stalled  1.1
be65bedd-0029-4cbb-9a85-58af3fb8997f  1           linux     iaas  main  stalled  1.1
be65bedd-c674-48c5-a0ad-44ab9006c5a9  0           linux     iaas  main  stalled  1.1

Some of those stalled workers are from a recent restart; the others are from a restart over 24h ago.

I would expect old workers to get cleaned up during the deploy process. Or is this expected behavior–are operators supposed to manually prune old workers after every restart?

  • Concourse version: 3.4
  • Deployment type (BOSH/Docker/binary): bosh
  • Infrastructure/IaaS: aws
  • Browser (if applicable): n/a
  • Did this used to work?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 4
  • Comments: 16 (7 by maintainers)

Commits related to this issue

Most upvoted comments

I can reproduce this on kubernetes when a node VM is destroyed. The worker enters stalled state and once rescheduled a few seconds later does not recover without pruning.

Worker logs just printProcess exited with status 1 repeatedly.

I’m experiencing the same problem with version 3.8.0 using the official helm chart.

As @ahume reported, I’m also only seeing worker logs just print Process exited with status 1 repeatedly.

We should probably just make them more resilient and wait for the TSA to become reachable again.

this is still true 🤔 (@vito)

I cobbled up this https://github.com/luispabon/concourse-workers-cleaner specifically for kubernetes as a cronjob for those who cannot yet upgrade Concourse. It’s a bit filthy but it does work.

I can confirm this is no longer an issue for me in Kubernetes after upgrading to Concourse 4.2.x and passing the --ephemeral flag to the workers pod definition, like so:

      containers:
        - image: concourse/concourse
          name: worker
          args: [worker, --ephemeral]

I am experiencing this behavior with our Kubernetes v1.11 (AWS EKS) cluster, it seems to happen once a node appears to go unknown, I deploy Concourse v4.2.2 using Helm chart v3.7.3 and was able to get my workers back in a running state by issuing the command fly -t cloud prune-workers -w <worker>, <worker> on all workers that were in a stalled state. It appeared I had to issue the command multiple times before the workers actually joined back. I am using persistent storage for the worker nodes and I am deploying the PostgresSQL Database externally.