actions-runner-controller: Controller is "leaking" offline runners (i.e. is not always calling Github's API to remove them)
Controller Version
0.25.2
Helm Chart Version
0.20.2
CertManager Version
1.9.0
Deployment Method
Helm
cert-manager installation
- yes
- yes
Checks
- This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
- My actions-runner-controller version (v0.x.y) does support the feature
- I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
creationTimestamp: "2022-08-05T08:48:21Z"
generation: 5145
name: autochecks
namespace: github-runner
resourceVersion: "22349026"
uid: d5660e91-55bf-44a1-9a9d-de7ea59dcde3
spec:
effectiveTime: "2022-08-16T17:05:44Z"
replicas: 5
selector: null
template:
metadata: {}
spec:
dockerEnabled: true
dockerVolumeMounts:
- mountPath: /home/runner/_work/
name: shared
readOnly: false
dockerdContainerResources:
limits:
cpu: 1
ephemeral-storage: 40Gi
memory: 4Gi
requests:
cpu: 500m
ephemeral-storage: 4Gi
memory: 1Gi
env:
- name: RUNNER_ALLOW_RUNASROOT
value: "1"
image: 123123123123.dkr.ecr.us-east-1.amazonaws.com/github-runner:node-12
imagePullSecrets:
- name: kubernetes
initContainers:
- command:
- sh
- -c
- cp /tmp/dockercfg/* /home/runner/.docker/
image: public.ecr.aws/docker/library/busybox:stable-musl
name: copy-dockerconfig
volumeMounts:
- mountPath: /home/runner/.docker/
name: dockercfg
- mountPath: /tmp/dockercfg
name: dockercfg-secret
labels:
- automated-checks
- automated-checks-ephemeral
nodeSelector:
eks.amazonaws.com/nodegroup: github-runners-tools-main
organization: MyOrg
resources:
limits:
cpu: 1
ephemeral-storage: 40Gi
memory: 1Gi
requests:
cpu: 1
ephemeral-storage: 4Gi
memory: 1Gi
securityContext:
fsGroup: 1000
serviceAccountName: github-runner
tolerations:
- effect: NoExecute
key: dedicated
operator: Equal
value: github-runners
volumeMounts:
- mountPath: /home/runner/.docker/
name: dockercfg
- mountPath: /home/runner/_work/
name: shared
readOnly: false
volumes:
- emptyDir: {}
name: dockercfg
- emptyDir: {}
name: shared
- name: dockercfg-secret
secret:
items:
- key: .dockerconfigjson
path: config.json
secretName: dockercfg
status:
availableReplicas: 5
desiredReplicas: 5
readyReplicas: 5
replicas: 5
updatedReplicas: 5
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
creationTimestamp: "2022-08-05T08:48:21Z"
generation: 7479
name: autochecks
namespace: github-runner
resourceVersion: "22350550"
uid: 02cf7e48-e161-4e4b-8630-40a28e559ccb
spec:
capacityReservations:
- effectiveTime: "2022-08-16T17:06:41Z"
expirationTime: "2022-08-16T17:11:41Z"
replicas: 1
- effectiveTime: "2022-08-16T17:06:46Z"
expirationTime: "2022-08-16T17:11:46Z"
replicas: 1
- effectiveTime: "2022-08-16T17:06:47Z"
expirationTime: "2022-08-16T17:11:47Z"
replicas: 1
- effectiveTime: "2022-08-16T17:06:47Z"
expirationTime: "2022-08-16T17:11:47Z"
replicas: 1
- effectiveTime: "2022-08-16T17:06:49Z"
expirationTime: "2022-08-16T17:11:49Z"
replicas: 1
maxReplicas: 20
minReplicas: 1
scaleDownDelaySecondsAfterScaleOut: 30
scaleTargetRef:
name: autochecks
scaleUpTriggers:
- duration: 5m0s
githubEvent:
workflowJob: {}
scheduledOverrides:
- endTime: "2022-05-16T05:00:00Z"
minReplicas: 0
recurrenceRule:
frequency: Weekly
startTime: "2022-05-13T22:00:00Z"
- endTime: "2022-05-17T05:00:00Z"
minReplicas: 0
recurrenceRule:
frequency: Daily
startTime: "2022-05-16T22:00:00Z"
status:
desiredReplicas: 6
lastSuccessfulScaleOutTime: "2022-08-16T17:06:50Z"
scheduledOverridesSummary: min=0 time=2022-08-16 22:00:00 +0000 UTC
To Reproduce
1. Shut down nodes constantly, or kill directly the pods doing `kubectl delete`. Or just wait a few weeks/months with cluster autoscaling scaling in and out constantly.
2. Watch the list returned by the `ListRunners` endpoint grow up to hundreds of thousands.
3. Controller goes nuts and doesn't work properly. Examples: lots of runner pods in Pending state, runner pods not getting deleted, other RunnerDeployments starving without any replicas, and so on.
Describe the bug
The ARC is “leaking” offline runners on github in specific situations, for example when the cluster autoscaler scales out nodes and there’s idle runners running there, they don’t “auto-remove” themselves from github and IMO that would be responsibility of the controller. This issue was probably introduced on 0.22 since on the release notes it says that there were removed unnecessary “Remove Runner” API Calls.
It is visible on https://github.com/organizations/MyOrg/settings/actions/runners
that the list increases every day a little bit until it starts affecting the controller that needs to execute the ListRunners
function on the autoscaling algorithm (see here). This becomes a problem when we get into a situation where there is thousands of Offline runners accumulated and the controller needs to iterate over all the runners (see here).
In my case, I noticed that the controller was not working correctly and there were more than 300 pages of offline runners listed for my organization. After I deleted all the offline runners, the controller started to work again. Now I’m running a script at the end of the day to remove all the offline runners that were “leaked” by the controller during the day to avoid this from happening again.
Describe the expected behavior
I would expect the controller to clean up all the offline runners from the github api in all situations. If that doesn’t happen they will start to accumulate until the controller is affected when executing the ListRunners
function.
Controller Logs
I don't have logs from the controller from when this was happening and now since I deleted the offline runners it is working properly. I also noticed that the `ListRunners` function (https://github.com/actions-runner-controller/actions-runner-controller/blob/538e2783d7fde279b84f1ff9351bb1486823660b/github/github.go#L219-L244) is not logging anything and therefore, logs wouldn't be helpful to debug this issue.
Runner Pod Logs
https://gist.github.com/bmbferreira/0ddb1efe1fe8e0dcfe4615aec2c6150a
Additional Context
No response
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 18
@bmbferreira take a look at https://github.blog/changelog/2022-08-03-github-actions-remove-offline-self-hosted-runners/
@jbuettnerbild Hey! Thanks for the update. If the 24h automated ephemeral runner cleanup doesn’t work, I think it’s due to something going wrong on the GitHub side. Could you open a support ticket to GitHub so that they can check their backend to see what’s happening?