actions-runner-controller: Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired
Describe the bug
If runner registration happens after runner registration token is expired, it fails repeatedly, enters CrashLoopBackOff
state for indefinite period, and never gets removed or updated by the controller.
To Reproduce Currently we see this happening due to the time between pod creation and the runner registration exceeding the 3 minute time defined in the controller: so a pod is created with a token that is expiring within just slightly over 3 minutes, and is used for registration only after it has expired.
One way simulate the delay between RegistrationTokenUpdated
and runner registration is to set STARTUP_DELAY_IN_SECONDS
to above 3 minutes.
Steps to reproduce the behavior:
- Create a new RunnerDeployment resource with
STARTUP_DELAY_IN_SECONDS
value set to360
(6 minutes) and let the controller create a runner resource and pod of it - Observe the registration token expiration of the newly created runner
- 4 minutes before its expiration, trigger creation of a new runner and pod (by deleting the pod)
- A new runner (and pod) should spawn with a token that is expiring within about 3 minutes and registers itself after the token is expired
- Registration fails repeatedly, and eventually pod enters CrashLoopBackOff state, and remains in this state indefinitely
Expected behavior Runner / Pods with expired registration token should be assigned a new token or be removed.
Environment
- Controller Version:
0.22.0
- Deployment Method:
Helm
- Helm Chart Version:
0.17.0
Additional info
This also seems to affect HorizontalRunnerAutoscaler
with PercentageRunnersBusy
strategy; the crashing pods seems to be counted as running, non-busy pod.
When enough pods enter CrashLoopBackOff
state and accumulate (enough to go below scaleDownThreshold
), it triggers scale down repeatedly, removing the (finished) healthy pods and keeping the crashing pods until the minimum number of runners is reached, making scale up impossible until the failing pods are manually removed.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 36
Commits related to this issue
- Make the hard-coded runner startup timeout to avoid race on token expiration longer Ref #1295 — committed to actions/actions-runner-controller by mumoshu 2 years ago
- Make the hard-coded runner startup timeout to avoid race on token expiration longer (#1296) Ref #1295 — committed to actions/actions-runner-controller by mumoshu 2 years ago
@MichaelSp we’re removing support for the
--once
flag extremely extremely soon https://github.com/actions-runner-controller/actions-runner-controller/issues/1196, you will need to implement one of the suggested solutions soon to avoid an outage. Upgrading to GHES to => 3.3 ASAP is the simpliest easiest solution to avoiding an outage.@mumoshu It does not seem to happen in ARC
0.21.1
: after a pod fails to start due to expired registration token (identical logs with my previous message), the pod is removed, I seeRegistrationTokenUpdated
event firing on the runner resource, the runner gets updated with a valid token (expireat
field included), and a new pod gets created with valid registration token, and all is fine.@toast-gear Looks like the several issues has been filed already (actions/runner#1739, actions/runner#1748) and a fix has been merged (actions/runner#1741) 4 days ago, but not yet included in the latest release (2 days ago v2.289.2).
that would be nice on paper however not really doable with the new architecture tbh. We now rely solely on the mutating webhook to inject reg tokens. Mutatingwebhook isn’t a regular k8s controller that works like “check if the pod spec contains expired token in envvar and update it”. It works more like “the pod is being updated/created for whatever reason. i’m going to inject the token but i dont care other fields and the lifecycle of the pod”
I’m going to close this off seen as we’ve resolved the core problem.