actions-runner-controller: Scaling runners based on webhook is sometimes stuck
Checks
- I’ve already read https://github.com/actions-runner-controller/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I’m sure my issue is not covered in the troubleshooting guide.
- I’m not using a custom entrypoint in my runner image
Controller Version
0.26.0
Helm Chart Version
0.21.1
CertManager Version
v1.10.1
Deployment Method
Helm
cert-manager installation
source: https://charts.jetstack.io Values:
installCRDs: true
podDnsPolicy: 'None'
podDnsConfig:
nameservers:
- '1.1.1.1'
- '8.8.8.8'
Standard helm upgrade --install
Checks
- This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
- My actions-runner-controller version (v0.x.y) does support the feature
- I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
- I’ve migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: small-gha-runner
spec:
template:
spec:
nodeSelector:
kubernetes.azure.com/agentpool: "fourvcpueph"
image: {{ .Values.image.repository }}/{{ .Values.image.name }}:{{ .Values.image.tag }}
imagePullPolicy: {{ .Values.image.imagePullPolicy }}
group: {{ .Values.github.runnersGroup }}
organization: {{ .Values.github.organization }}
labels:
- small-gha-runner
- ubuntu-latest-small
resources:
limits:
memory: 5Gi
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: small-gha-runner-autoscaler
spec:
scaleDownDelaySecondsAfterScaleOut: 30
minReplicas: 1
maxReplicas: 20
scaleTargetRef:
kind: RunnerDeployment
name: small-gha-runner
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: '10m'
To Reproduce
1. Define several github workflows with trigger set on Pull Requests
2. Ask developers to start working
3. Observe the situation when many workflows are triggered, new commits pushed. We also have:
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
set in workflows, but that doesn't seem to happen only when such cancel takes place for a particular branch.
Describe the bug
Scaling is sometimes stuck for no visible reason.
Editiing the horizontalrunnerautoscalers.actions.summerwind.dev and forcing minReplicas
to be higher than current value also doesn’t take place during that period.
No errors in any ACR component.
Jobs are pending in Queued
state for up to 1hour during the day.
Describe the expected behavior
Runners are scaled up/down based on the number of queued jobs in Github workflow.
Whole Controller Logs
https://gist.github.com/damyan90/7ffacb6f48ae10f13fd5cf168da142ac
Whole Runner Pod Logs
Not relevant really, but here's an example one:
https://gist.github.com/damyan90/567979a84cbbd1210ad1ac423e7bac38
Additional Context
Webhook delivery: https://user-images.githubusercontent.com/24733538/205882331-086724a1-48b7-4e3f-ae77-23d35f959d02.png - 100% successful.
Runner’s definition:
FROM summerwind/actions-runner:latest
RUN lsb_release -ra
ENV DIR=opt
COPY apt.sh /$DIR/
RUN sudo chmod +x /$DIR/apt.sh && sudo sh /$DIR/apt.sh
COPY azure.sh /$DIR/
RUN sudo chmod +x /$DIR/azure.sh && sudo sh /$DIR/azure.sh
COPY software.sh software.json /$DIR/
RUN cd $DIR && sudo chmod +x /$DIR/software.sh && sudo sh /$DIR/software.sh
COPY cleanup.sh /$DIR/
RUN sudo chmod +x /$DIR/cleanup.sh && sudo sh /$DIR/cleanup.sh
RUN sudo apt-get update && sudo apt-get dist-upgrade -y
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 16 (2 by maintainers)
I’m experiencing a very similar issue. In my case I have
minReplicas: 0
trying to scale up from 0 since these particular runners we need are quite large and don’t want to keep idle ones running.We are expecting very similar issue.
Workflow queued, webhook sent from Github, github-webhook-server received the requests and trying to patch HRA. see the logs below:
It’s saying
"before": 0
, but in fact there is already a Runner (pod) up and running some job. And new Runner/pod is not creating.So, in the end, all the next attempts to scale the RunnerDeployment are failing until the Job is finished on the existing “unrecognized” Runner.
@mumoshu, any ideas what can be the root cause?
Thanks in advance!
@mumoshu i installed it about a week ago on our production env actually (we are on the move to github actions so its fine) i managed to spawn up 52 runners for 52 jobs that came from 10 pull requests the min replicas was 1 and the max replicas was 80, and it looks good so far!
i will try it with scaling from zero and update
Not sure. I have 2 replicas for the webhook as well as for the controller and it works fine for now. I don’t see too many issues, but I’m also not downscaling to 0. So might be that I’m just hiding the issue a bit. Waiting for a new versions though. There’s been some development together with Github on the autoscaling matter, for now available only to some beta testers. I suppose these issues were addressed there too.
I have the same issue. And ARC in webhook mode is impossible to use with matrix builds at all. I am considering of backing the webhook with a second pull controller, so I can get the best of the two worlds. I wonther if anyone have tried that.
i have the same problem, what I’m seeing though, is for matrix jobs its not dispatching an event for each matrix item. Which would be more related to a github platform issue.
Just did! That’s awesome! 😉