actions-runner-controller: Using containerMode kubernetes causes random step failures
Checks
- I’ve already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I’m sure my issue is not covered in the troubleshooting guide.
- I’m not using a custom entrypoint in my runner image
Controller Version
0.4.0
Helm Chart Version
0.4.0
CertManager Version
N/A
Deployment Method
Helm
cert-manager installation
cert-manager not required
Checks
- This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
- My actions-runner-controller version (v0.x.y) does support the feature
- I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
- I’ve migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
containerMode:
type: "kubernetes"
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteOnce"]
storageClassName: "default"
resources:
requests:
storage: 16Gi
template:
spec:
restartPolicy: Never
nodeSelector:
kubernetes.io/os: linux
initContainers:
- name: init-k8s-volume-permissions
image: ghcr.io/actions/actions-runner:latest
command: ["sudo", "chown", "-R", "runner", "/home/runner/_work"]
volumeMounts:
- name: work
mountPath: /home/runner/_work
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
resources:
requests:
cpu: "1.3"
To Reproduce
This only occurs when running a large workflow so far. The workflow that fails has up to 20 parallel jobs running each comprised of about 9 steps. A few dozen jobs run successfully but some fail in a random step with the following error:
Run '/home/runner/k8s/index.js'
node:internal/process/promises:279
triggerUncaughtException(err, true /* fromPromise */);
^
[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
code: 'ERR_UNHANDLED_REJECTION'
}
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
Describe the bug
Some jobs abort in seemingly random steps . All fail with the error listed under reproduce step.
Describe the expected behavior
That the workflow has executed successfully dozens of times when using containermode dind.
I expect the workflow to also execute reliably when using containermode kubernetes.
Whole Controller Logs
Will update this after next failing run together with runner pod log of a failing job.
Whole Runner Pod Logs
Will try to extract the logs of the runner pod running a job that fails, but since the pods immediately disappear after failure I will need to stream all runner pod logs to a file and then filter it for a failing pod.
Additional Context
Note that I had to add the initcontainer that fixes the permission on the kubernetesModeWorkVolumeClaim pv because its provisioned by azure as an empty filesystem owned by root:root and the runner runs under user runner. Without it the runner pod itself immediately fails with an error that it cannot write to the _work folder.
This issue might actually be in https://github.com/actions/runner-container-hooks, if so desired Im happy to create a linked issue there.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (7 by maintainers)
I will close this issue here since it is not related to the ARC. I have created a feature request in the container hook repository to implement retries.
Downgraded to AKS Free tier again and immediately I’m seeing more random failures again (those UnhandledPromiseRejection ones).
So my hunch is that it has to do with rate limiting on the K8s API and the k8s hooks having no back-off/retry mechanism. Even on the paid AKS tier there seems to be the occasional timeout connecting to the k8s API (Error: Error: connect ETIMEDOUT 10.0.0.1:443), but that seems to rarely happen (as in less then 1 out of 100 jobs), but still some form of retry might make this all a whole lot more reliable.