actions-runner-controller: Scaling runners based on webhook is sometimes stuck

Checks

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

v1.10.1

Deployment Method

Helm

cert-manager installation

source: https://charts.jetstack.io Values:

installCRDs: true
podDnsPolicy: 'None'
podDnsConfig:
  nameservers:
    - '1.1.1.1'
    - '8.8.8.8'

Standard helm upgrade --install

Checks

  • This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
  • I’ve migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: small-gha-runner
spec:
  template:
    spec:
      nodeSelector:
        kubernetes.azure.com/agentpool: "fourvcpueph"
      image: {{ .Values.image.repository }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
      imagePullPolicy: {{ .Values.image.imagePullPolicy }}
      group: {{ .Values.github.runnersGroup }}
      organization: {{ .Values.github.organization }}
      labels:
        - small-gha-runner
        - ubuntu-latest-small
      resources:
        limits:
          memory: 5Gi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: small-gha-runner-autoscaler
spec:
  scaleDownDelaySecondsAfterScaleOut: 30
  minReplicas: 1
  maxReplicas: 20
  scaleTargetRef:
    kind: RunnerDeployment
    name: small-gha-runner
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: '10m'

To Reproduce

1. Define several github workflows with trigger set on Pull Requests
2. Ask developers to start working
3. Observe the situation when many workflows are triggered, new commits pushed. We also have:

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

set in workflows, but that doesn't seem to happen only when such cancel takes place for a particular branch.

Describe the bug

Scaling is sometimes stuck for no visible reason. Editiing the horizontalrunnerautoscalers.actions.summerwind.dev and forcing minReplicas to be higher than current value also doesn’t take place during that period.

No errors in any ACR component.

Jobs are pending in Queued state for up to 1hour during the day.

Describe the expected behavior

Runners are scaled up/down based on the number of queued jobs in Github workflow.

Whole Controller Logs

https://gist.github.com/damyan90/7ffacb6f48ae10f13fd5cf168da142ac

Whole Runner Pod Logs

Not relevant really, but here's an example one:
https://gist.github.com/damyan90/567979a84cbbd1210ad1ac423e7bac38

Additional Context

2022-12-06_10-20 2022-12-06_10-40 2022-12-06_10-40_1 2022-12-06_10-40_2

Webhook delivery: https://user-images.githubusercontent.com/24733538/205882331-086724a1-48b7-4e3f-ae77-23d35f959d02.png - 100% successful.

Runner’s definition:

FROM summerwind/actions-runner:latest

RUN lsb_release -ra

ENV DIR=opt
COPY apt.sh /$DIR/
RUN sudo chmod +x /$DIR/apt.sh && sudo sh /$DIR/apt.sh 

COPY azure.sh /$DIR/
RUN sudo chmod +x /$DIR/azure.sh && sudo sh /$DIR/azure.sh

COPY software.sh software.json /$DIR/
RUN cd $DIR && sudo chmod +x /$DIR/software.sh && sudo sh /$DIR/software.sh

COPY cleanup.sh /$DIR/
RUN sudo chmod +x /$DIR/cleanup.sh && sudo sh /$DIR/cleanup.sh

RUN sudo apt-get update && sudo apt-get dist-upgrade -y

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 7
  • Comments: 16 (2 by maintainers)

Most upvoted comments

I’m experiencing a very similar issue. In my case I have minReplicas: 0 trying to scale up from 0 since these particular runners we need are quite large and don’t want to keep idle ones running.

We are expecting very similar issue.

Workflow queued, webhook sent from Github, github-webhook-server received the requests and trying to patch HRA. see the logs below:

2023-03-03T14:01:16Z	DEBUG	controllers.webhookbasedautoscaler	Patching hra my-failing-hra for capacityReservations update	{"before": 0, "expired": -1, "added": 1, "completed": 0, "after": 1}

It’s saying "before": 0, but in fact there is already a Runner (pod) up and running some job. And new Runner/pod is not creating.

So, in the end, all the next attempts to scale the RunnerDeployment are failing until the Job is finished on the existing “unrecognized” Runner.

@mumoshu, any ideas what can be the root cause?

Thanks in advance!

@mumoshu i installed it about a week ago on our production env actually (we are on the move to github actions so its fine) i managed to spawn up 52 runners for 52 jobs that came from 10 pull requests the min replicas was 1 and the max replicas was 80, and it looks good so far!

i will try it with scaling from zero and update

Not sure. I have 2 replicas for the webhook as well as for the controller and it works fine for now. I don’t see too many issues, but I’m also not downscaling to 0. So might be that I’m just hiding the issue a bit. Waiting for a new versions though. There’s been some development together with Github on the autoscaling matter, for now available only to some beta testers. I suppose these issues were addressed there too.

I have the same issue. And ARC in webhook mode is impossible to use with matrix builds at all. I am considering of backing the webhook with a second pull controller, so I can get the best of the two worlds. I wonther if anyone have tried that.

i have the same problem, what I’m seeing though, is for matrix jobs its not dispatching an event for each matrix item. Which would be more related to a github platform issue.

Just did! That’s awesome! 😉