actions-runner-controller: Pods created by `RunnerSet` are terminated while still running jobs on a rolling update

Controller Version

0.24.0

Helm Chart Version

0.19.0

CertManager Version

1.4.1

Deployment Method

Helm

cert-manager installation

Deploying cert-manager Helm chart from https://charts.jetstack.io/

Checks

  • This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue

Resource Definitions

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: clr-runner
  namespace: actions-runner-groups
spec:
  dockerdWithinRunnerContainer: false
  ephemeral: true
  labels:
  - clr-runner
  organization: color
  replicas: 1
  selector:
    matchLabels:
      app: clr-runner
  serviceName: clr-runner
  template:
    metadata:
      labels:
        app: clr-runner
    spec:
      containers:
      - image: 301643779712.dkr.ecr.us-east-1.amazonaws.com/color-actions-runner:master_c704c14c
        name: runner
        resources:
          limits:
            cpu: 8
            memory: 32Gi
          requests:
            cpu: 8
            memory: 32Gi
      - image: public.ecr.aws/docker/library/docker:dind
        name: docker
      securityContext:
        fsGroup: 1000
      serviceAccountName: actions-runner
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: clr-runner
  namespace: actions-runner-groups
spec:
  maxReplicas: 30
  metrics:
  - scaleDownFactor: "0.7"
    scaleDownThreshold: "0.2"
    scaleUpFactor: "2.5"
    scaleUpThreshold: "0.5"
    type: PercentageRunnersBusy
  minReplicas: 1
  scaleDownDelaySecondsAfterScaleOut: 3600
  scaleTargetRef:
    kind: RunnerSet
    name: clr-runner

To Reproduce

1. Change the `image` for the `runner` container in the manifest above
2. Launch a workflow that starts running some jobs on the runners
3. `kubectl apply` it to update the runners

Describe the bug

All runner pods are restarted within < a minute of the kubectl apply, even those that are still running jobs. The in-flight jobs are dropped and appear hung in the GitHub UI (they eventually time out).

Describe the expected behavior

Pods aren’t terminated until they’ve finished running in-flight jobs

Controller Logs

Will post in separate comment to avoid #1533

Runner Pod Logs

N/A pods are deleted

Additional Context

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 18 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Hey everyone! I have an update- #1759 should fix this.

In contrast to RunnerDeployment, RunnerSet-managed runner pods don’t have the same controller-side graceful termination logic. That doesn’t change in #1759.

However, you can now let the vanilla Kubernetes pod termination process correctly graceful-stop runners. Configure RUNNER_GRACEFUL_STOP_TIMEOUT and terminationGracePeriodSeconds appropriately. More information on the updated REAMDE.

If you’re interested in how it’s supposed to work, please read the new section in the updated README, and also https://github.com/actions-runner-controller/actions-runner-controller/issues/1581#issuecomment-1229616193.