actions-runner-controller: Pod stuck on CrashLoopBackOff state if runner registration happens after token is expired

Describe the bug If runner registration happens after runner registration token is expired, it fails repeatedly, enters CrashLoopBackOff state for indefinite period, and never gets removed or updated by the controller.

To Reproduce Currently we see this happening due to the time between pod creation and the runner registration exceeding the 3 minute time defined in the controller: so a pod is created with a token that is expiring within just slightly over 3 minutes, and is used for registration only after it has expired.

One way simulate the delay between RegistrationTokenUpdated and runner registration is to set STARTUP_DELAY_IN_SECONDS to above 3 minutes.

Steps to reproduce the behavior:

  1. Create a new RunnerDeployment resource with STARTUP_DELAY_IN_SECONDS value set to 360 (6 minutes) and let the controller create a runner resource and pod of it
  2. Observe the registration token expiration of the newly created runner
  3. 4 minutes before its expiration, trigger creation of a new runner and pod (by deleting the pod)
  4. A new runner (and pod) should spawn with a token that is expiring within about 3 minutes and registers itself after the token is expired
  5. Registration fails repeatedly, and eventually pod enters CrashLoopBackOff state, and remains in this state indefinitely

Expected behavior Runner / Pods with expired registration token should be assigned a new token or be removed.

Environment

  • Controller Version: 0.22.0
  • Deployment Method: Helm
  • Helm Chart Version: 0.17.0

Additional info This also seems to affect HorizontalRunnerAutoscaler with PercentageRunnersBusy strategy; the crashing pods seems to be counted as running, non-busy pod.

When enough pods enter CrashLoopBackOff state and accumulate (enough to go below scaleDownThreshold), it triggers scale down repeatedly, removing the (finished) healthy pods and keeping the crashing pods until the minimum number of runners is reached, making scale up impossible until the failing pods are manually removed.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 36

Commits related to this issue

Most upvoted comments

@MichaelSp we’re removing support for the --once flag extremely extremely soon https://github.com/actions-runner-controller/actions-runner-controller/issues/1196, you will need to implement one of the suggested solutions soon to avoid an outage. Upgrading to GHES to => 3.3 ASAP is the simpliest easiest solution to avoiding an outage.

@mumoshu It does not seem to happen in ARC 0.21.1: after a pod fails to start due to expired registration token (identical logs with my previous message), the pod is removed, I see RegistrationTokenUpdated event firing on the runner resource, the runner gets updated with a valid token (expireat field included), and a new pod gets created with valid registration token, and all is fine.

@toast-gear Looks like the several issues has been filed already (actions/runner#1739, actions/runner#1748) and a fix has been merged (actions/runner#1741) 4 days ago, but not yet included in the latest release (2 days ago v2.289.2).

I think the current fix is sufficient/enough (if your pod takes longer than 30 minutes to start, you have other problems), but it would perhaps be better if the controller can assign new valid registration token to runner/pods with expired tokens that are still in NotReady state.

that would be nice on paper however not really doable with the new architecture tbh. We now rely solely on the mutating webhook to inject reg tokens. Mutatingwebhook isn’t a regular k8s controller that works like “check if the pod spec contains expired token in envvar and update it”. It works more like “the pod is being updated/created for whatever reason. i’m going to inject the token but i dont care other fields and the lifecycle of the pod”

@toast-gear we have upgraded to 0.22.3 and while we stopped getting random pods stuck on CrashLoopBackOff every hour, we still encounter some stuck pods — due to other reasons causing some pods to take longer than 30 minutes to start — so at the moment we still need to manually clean things up.

I’m going to close this off seen as we’ve resolved the core problem.