actions-runner-controller: Using containerMode kubernetes causes random step failures

Checks

Controller Version

0.4.0

Helm Chart Version

0.4.0

CertManager Version

N/A

Deployment Method

Helm

cert-manager installation

cert-manager not required

Checks

  • This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
  • I’ve migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "default"
    resources:
      requests:
        storage: 16Gi

template:
  spec:
    restartPolicy: Never
    nodeSelector:
      kubernetes.io/os: linux
    initContainers:
    - name: init-k8s-volume-permissions
      image: ghcr.io/actions/actions-runner:latest
      command: ["sudo", "chown", "-R", "runner", "/home/runner/_work"]
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      resources:
        requests:
          cpu: "1.3"

To Reproduce

This only occurs when running a large workflow so far. The workflow that fails has up to 20 parallel jobs running each comprised of about 9 steps. A few dozen jobs run successfully but some fail in a random step with the following error:

Run '/home/runner/k8s/index.js'
node:internal/process/promises:279
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

Describe the bug

Some jobs abort in seemingly random steps . All fail with the error listed under reproduce step.

Describe the expected behavior

That the workflow has executed successfully dozens of times when using containermode dind.

I expect the workflow to also execute reliably when using containermode kubernetes.

Whole Controller Logs

Will update this after next failing run together with runner pod log of a failing job.

Whole Runner Pod Logs

Will try to extract the logs of the runner pod running a job that fails, but since the pods immediately disappear after failure I will need to stream all runner pod logs to a file and then filter it for a failing pod.

Additional Context

Note that I had to add the initcontainer that fixes the permission on the kubernetesModeWorkVolumeClaim pv because its provisioned by azure as an empty filesystem owned by root:root and the runner runs under user runner. Without it the runner pod itself immediately fails with an error that it cannot write to the _work folder.

This issue might actually be in https://github.com/actions/runner-container-hooks, if so desired Im happy to create a linked issue there.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

I will close this issue here since it is not related to the ARC. I have created a feature request in the container hook repository to implement retries.

Downgraded to AKS Free tier again and immediately I’m seeing more random failures again (those UnhandledPromiseRejection ones).

So my hunch is that it has to do with rate limiting on the K8s API and the k8s hooks having no back-off/retry mechanism. Even on the paid AKS tier there seems to be the occasional timeout connecting to the k8s API (Error: Error: connect ETIMEDOUT 10.0.0.1:443), but that seems to rarely happen (as in less then 1 out of 100 jobs), but still some form of retry might make this all a whole lot more reliable.