actions-runner-controller: Pods stuck in terminating state
We experience that some runner pods are stuck in terminating state.
The pod is still registered as a runner in github. No change if the runner is deleted forcefully in gha api.
The actions-controller is continuously logging: actions-runner-controller.runnerpod Runner pod is annotated to wait for completion
.
This seams to happend if a node is being deleted with a runner.
We are running spot nodes and cluster-autoscaler, which seams to be making this issue a bit more apparent.
The last log event for the runner pod is: https://github.com/actions-runner-controller/actions-runner-controller/blob/e7200f274d592729b46848218c1d9c54214065c9/controllers/runner_graceful_stop.go#L115
As the pod has an error exit code (127) we think that a forceful delete of the runner is needed. https://github.com/actions-runner-controller/actions-runner-controller/blob/af8d8f7e1da4b32d837428f013b7b68510347343/controllers/runner_graceful_stop.go#L115 will only requeue the failed reconcile, and in this case never delete the pod.
unregisterRunner should probably be run if exist != 0 exists.
Name: runner
Namespace: monorepo
Start Time: Wed, 20 Apr 2022 14:17:24 +0200
Labels: actions-runner-controller/inject-registration-token=true
pod-template-hash=7cb656579c
runner-deployment-name=runner
runner-template-hash=54d74c6f69
runnerset-name=runner-f87wk-wr4qk
Annotations: actions-runner-controller/token-expires-at: 2022-04-20T12:43:57Z
actions-runner/id: 115340
actions-runner/runner-completion-wait-start-timestamp: 2022-04-20T12:17:56Z
actions-runner/unregistration-request-timestamp: 2022-04-20T12:14:13Z
actions-runner/unregistration-start-timestamp: 2022-04-20T12:17:55Z
sync-time: 2022-04-20T11:57:34Z
Status: Terminating (lasts 138m)
Termination Grace Period: 0s
Controlled By: Runner/runner-f87wk-wr4qk
Containers:
runner:
State: Waiting
Reason: ContainerCreating
Last State: Terminated
Reason: ContainerStatusUnknown
Message: The container could not be located when the pod was deleted. The container used to be Running
Exit Code: 137
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 4
- Comments: 19 (2 by maintainers)
Commits related to this issue
- Fix runner pods created by RunnerDeployment stuck in Terminating when the runner container disappeared Ref #1369 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Fix runner pods managed by RunnerSet to not stuck in Terminating This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDepl... — committed to actions/actions-runner-controller by mumoshu 2 years ago
- fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420) This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for R... — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared Ref #1369 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared Ref #1369 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared Ref #1369 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395) Ref #1369 — committed to actions/actions-runner-controller by mumoshu 2 years ago
@mumoshu In our case here, the underlying node has already been gone. When we null out finalizers, cleanup happens fine.
@mumoshu Hello I’ve already saw this problem but in another story.
after some analysis i think that the problem occurs because the finalizer can’t exit 0 for some internal reason (the runner is marked offline or any reason…)
So a GH api fails and the finalizer fails too.
I saw also that you declare finalizer for custom resources Runner probably it’s better to use a watcher to delete associated ressources when custom is deleted.
So when the controller die for external reason the finalizer is unregistered of K8s after that resources still stuck in terminate state when you delete it. Probably a bug in k8s The solution is to clear the finalizer in metadata field.