actions-runner-controller: arc-runner-set not scaling down intermittently | gha-runner-scale-set:0.6.1

Checks

I’ve already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I’m sure my issue is not covered in the troubleshooting guide.
I’m not using a custom entrypoint in my runner image

Controller Version

0.6.1

Helm Chart Version

0.6.1

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Checks

This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
I’ve migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

template:  
  spec:
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        #intentionally kept 10s sleep to make sure istio-proxy container is up and running
        command: [ "/bin/bash","-c","sleep 10 && /home/runner/run.sh" ]

To Reproduce

I am trying out github ARC
1. I have created a sample workflow which just has 1 job and performs sleep 30s and prints Hello World.
2. I have managed to set-up the arc-gha-rs-controller and arc-runner-set properly and they are up and running.
3. Job is running as expected but some times after the job completion the runner-set is not scaling down to 0. (Intermittent)

Describe the bug

When the job finishes runner-scale-set is not scaling down to 0. This is a intermittent behavior. So I have this stale runner staying there in the cluster(I have not set any min or max runners). Then if I trigger a new job this stale runner is not picking up the job and job will be in queue state for ever.

Describe the expected behavior

As soon as the job completes the runner should terminate.

Whole Controller Logs

https://gist.github.com/ChaitanyaAtchuta5/7692ccd1e35e4b6706f8a0f20a570aaf

Whole Runner Pod Logs

Logs when not working as expected (Runner pod not getting terminated when job completes)
https://gist.github.com/ChaitanyaAtchuta5/881715bfec200f42de97adebec44e926
Logs when working as expected (Runner pod gets terminated when job completes)
https://gist.github.com/ChaitanyaAtchuta5/c847653cb266dcc1b010a427767a2d51

Additional Context

About this issue

Original URL
State: closed
Created 9 months ago
Comments: 18 (5 by maintainers)

Most upvoted comments

Thank you @ChaitanyaAtchuta5 this is very helpful, we’ll take a look!

Link- on Sep 29, 2023

@Link- I deleted the helm deployments of gha-runner-scale-set-controller & gha-runner-scale-set and installed them again freshly. Note:

I am not using any containerMode, just a normal mode with this runner image “ghcr.io/actions/actions-runner:latest”
I haven’t set any min or max runners
Using pat token at repository level
AKS cluster with isto-proxy setup

After they got installed without any errors and made sure both the controller and listener pods were up and running. I have triggered a workflow. The new runner pod came up and completed the job. Then he runner pod went to terminating stage and got terminated. Suddenly a new runner pod got into init stage at this point of time there are no workflows running in my repo. After few seconds that runner pod also got terminated. Everything looks okay for now. Later, I went a head and re-ran the same workflow, the runner pod came up and completed the job as usual. But this time the runner pod was not terminated even after the job was completed, which is not an expected behavior.

Workflow status screenshot

Runner-set status from github

Kubectl output

Controller Logs (from setup to hitting the issue) https://gist.github.com/ChaitanyaAtchuta5/b579e55b710b6e98f0760b70442cbd7a

Listner logs (from setup to hitting the issue) https://gist.github.com/ChaitanyaAtchuta5/488ee8b8c1dbe09d9d40bedea243b859

Runner pod logs that not got terminated https://gist.github.com/ChaitanyaAtchuta5/9bff813e8be3b6a4dab655115c9582dc

ChaitanyaAtchuta5 on Sep 29, 2023