actions-runner-controller: Controller is "leaking" offline runners (i.e. is not always calling Github's API to remove them)

Controller Version

0.25.2

Helm Chart Version

0.20.2

CertManager Version

1.9.0

Deployment Method

Helm

cert-manager installation

  • yes
  • yes

Checks

  • This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  creationTimestamp: "2022-08-05T08:48:21Z"
  generation: 5145
  name: autochecks
  namespace: github-runner
  resourceVersion: "22349026"
  uid: d5660e91-55bf-44a1-9a9d-de7ea59dcde3
spec:
  effectiveTime: "2022-08-16T17:05:44Z"
  replicas: 5
  selector: null
  template:
    metadata: {}
    spec:
      dockerEnabled: true
      dockerVolumeMounts:
      - mountPath: /home/runner/_work/
        name: shared
        readOnly: false
      dockerdContainerResources:
        limits:
          cpu: 1
          ephemeral-storage: 40Gi
          memory: 4Gi
        requests:
          cpu: 500m
          ephemeral-storage: 4Gi
          memory: 1Gi
      env:
      - name: RUNNER_ALLOW_RUNASROOT
        value: "1"
      image: 123123123123.dkr.ecr.us-east-1.amazonaws.com/github-runner:node-12
      imagePullSecrets:
      - name: kubernetes
      initContainers:
      - command:
        - sh
        - -c
        - cp /tmp/dockercfg/* /home/runner/.docker/
        image: public.ecr.aws/docker/library/busybox:stable-musl
        name: copy-dockerconfig
        volumeMounts:
        - mountPath: /home/runner/.docker/
          name: dockercfg
        - mountPath: /tmp/dockercfg
          name: dockercfg-secret
      labels:
      - automated-checks
      - automated-checks-ephemeral
      nodeSelector:
        eks.amazonaws.com/nodegroup: github-runners-tools-main
      organization: MyOrg
      resources:
        limits:
          cpu: 1
          ephemeral-storage: 40Gi
          memory: 1Gi
        requests:
          cpu: 1
          ephemeral-storage: 4Gi
          memory: 1Gi
      securityContext:
        fsGroup: 1000
      serviceAccountName: github-runner
      tolerations:
      - effect: NoExecute
        key: dedicated
        operator: Equal
        value: github-runners
      volumeMounts:
      - mountPath: /home/runner/.docker/
        name: dockercfg
      - mountPath: /home/runner/_work/
        name: shared
        readOnly: false
      volumes:
      - emptyDir: {}
        name: dockercfg
      - emptyDir: {}
        name: shared
      - name: dockercfg-secret
        secret:
          items:
          - key: .dockerconfigjson
            path: config.json
          secretName: dockercfg
status:
  availableReplicas: 5
  desiredReplicas: 5
  readyReplicas: 5
  replicas: 5
  updatedReplicas: 5
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  creationTimestamp: "2022-08-05T08:48:21Z"
  generation: 7479
  name: autochecks
  namespace: github-runner
  resourceVersion: "22350550"
  uid: 02cf7e48-e161-4e4b-8630-40a28e559ccb
spec:
  capacityReservations:
  - effectiveTime: "2022-08-16T17:06:41Z"
    expirationTime: "2022-08-16T17:11:41Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:46Z"
    expirationTime: "2022-08-16T17:11:46Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:47Z"
    expirationTime: "2022-08-16T17:11:47Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:47Z"
    expirationTime: "2022-08-16T17:11:47Z"
    replicas: 1
  - effectiveTime: "2022-08-16T17:06:49Z"
    expirationTime: "2022-08-16T17:11:49Z"
    replicas: 1
  maxReplicas: 20
  minReplicas: 1
  scaleDownDelaySecondsAfterScaleOut: 30
  scaleTargetRef:
    name: autochecks
  scaleUpTriggers:
  - duration: 5m0s
    githubEvent:
      workflowJob: {}
  scheduledOverrides:
  - endTime: "2022-05-16T05:00:00Z"
    minReplicas: 0
    recurrenceRule:
      frequency: Weekly
    startTime: "2022-05-13T22:00:00Z"
  - endTime: "2022-05-17T05:00:00Z"
    minReplicas: 0
    recurrenceRule:
      frequency: Daily
    startTime: "2022-05-16T22:00:00Z"
status:
  desiredReplicas: 6
  lastSuccessfulScaleOutTime: "2022-08-16T17:06:50Z"
  scheduledOverridesSummary: min=0 time=2022-08-16 22:00:00 +0000 UTC

To Reproduce


1. Shut down nodes constantly, or kill directly the pods doing `kubectl delete`. Or just wait a few weeks/months with cluster autoscaling scaling in and out constantly.
2. Watch the list returned by the `ListRunners` endpoint grow up to hundreds of thousands.
3. Controller goes nuts and doesn't work properly. Examples: lots of runner pods in Pending state, runner pods not getting deleted, other RunnerDeployments starving without any replicas, and so on.

Describe the bug

The ARC is “leaking” offline runners on github in specific situations, for example when the cluster autoscaler scales out nodes and there’s idle runners running there, they don’t “auto-remove” themselves from github and IMO that would be responsibility of the controller. This issue was probably introduced on 0.22 since on the release notes it says that there were removed unnecessary “Remove Runner” API Calls. It is visible on https://github.com/organizations/MyOrg/settings/actions/runners that the list increases every day a little bit until it starts affecting the controller that needs to execute the ListRunners function on the autoscaling algorithm (see here). This becomes a problem when we get into a situation where there is thousands of Offline runners accumulated and the controller needs to iterate over all the runners (see here). In my case, I noticed that the controller was not working correctly and there were more than 300 pages of offline runners listed for my organization. After I deleted all the offline runners, the controller started to work again. Now I’m running a script at the end of the day to remove all the offline runners that were “leaked” by the controller during the day to avoid this from happening again.

Describe the expected behavior

I would expect the controller to clean up all the offline runners from the github api in all situations. If that doesn’t happen they will start to accumulate until the controller is affected when executing the ListRunners function.

Controller Logs

I don't have logs from the controller from when this was happening and now since I deleted the offline runners it is working properly. I also noticed that the `ListRunners` function (https://github.com/actions-runner-controller/actions-runner-controller/blob/538e2783d7fde279b84f1ff9351bb1486823660b/github/github.go#L219-L244) is not logging anything and therefore, logs wouldn't be helpful to debug this issue.

Runner Pod Logs

https://gist.github.com/bmbferreira/0ddb1efe1fe8e0dcfe4615aec2c6150a

Additional Context

No response

About this issue

Most upvoted comments

@jbuettnerbild Hey! Thanks for the update. If the 24h automated ephemeral runner cleanup doesn’t work, I think it’s due to something going wrong on the GitHub side. Could you open a support ticket to GitHub so that they can check their backend to see what’s happening?