actions-runner-controller: Pods stuck in terminating state

We experience that some runner pods are stuck in terminating state. The pod is still registered as a runner in github. No change if the runner is deleted forcefully in gha api. The actions-controller is continuously logging: actions-runner-controller.runnerpod Runner pod is annotated to wait for completion .

This seams to happend if a node is being deleted with a runner.

We are running spot nodes and cluster-autoscaler, which seams to be making this issue a bit more apparent.

The last log event for the runner pod is: https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115

As the pod has an error exit code (127) we think that a forceful delete of the runner is needed. https://github.com/actions-runner-controller/actions-runner-controller/blob/af8d8f7e1da4b32d837428f013b7b68510347343/controllers/runner_graceful_stop.go#L115 will only requeue the failed reconcile, and in this case never delete the pod.

unregisterRunner should probably be run if exist != 0 exists.

Name:                      runner
Namespace:                 monorepo
Start Time:                Wed, 20 Apr 2022 14:17:24 +0200
Labels:                    actions-runner-controller/inject-registration-token=true
                           pod-template-hash=7cb656579c
                           runner-deployment-name=runner
                           runner-template-hash=54d74c6f69
                           runnerset-name=runner-f87wk-wr4qk
Annotations:               actions-runner-controller/token-expires-at: 2022-04-20T12:43:57Z
                           actions-runner/id: 115340
                           actions-runner/runner-completion-wait-start-timestamp: 2022-04-20T12:17:56Z
                           actions-runner/unregistration-request-timestamp: 2022-04-20T12:14:13Z
                           actions-runner/unregistration-start-timestamp: 2022-04-20T12:17:55Z
                           sync-time: 2022-04-20T11:57:34Z
Status:                    Terminating (lasts 138m)
Termination Grace Period:  0s
Controlled By:  Runner/runner-f87wk-wr4qk
Containers:
  runner:
    State:          Waiting
      Reason:       ContainerCreating
    Last State:     Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was deleted.  The container used to be Running
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 19 (2 by maintainers)

Commits related to this issue

Most upvoted comments

@mumoshu In our case here, the underlying node has already been gone. When we null out finalizers, cleanup happens fine.

@mumoshu Hello I’ve already saw this problem but in another story.

after some analysis i think that the problem occurs because the finalizer can’t exit 0 for some internal reason (the runner is marked offline or any reason…)

So a GH api fails and the finalizer fails too.

I saw also that you declare finalizer for custom resources Runner probably it’s better to use a watcher to delete associated ressources when custom is deleted.

So when the controller die for external reason the finalizer is unregistered of K8s after that resources still stuck in terminate state when you delete it. Probably a bug in k8s The solution is to clear the finalizer in metadata field.