actions-runner-controller: Cannot pass nodeSelector, tolarations and resources in containerMode: kubernetes

Controller Version

0.25.2

Helm Chart Version

0.20.2

CertManager Version

1.9.1

Deployment Method

Helm

cert-manager installation

Cert-manager is installed using helmfile

helm/
  cert-manager/
    values.yaml
helmfile.yaml

Contents of values.yaml

installCRDs: true

Contents of helmfile.yaml

helmDefaults:
  createNamespace: true
  atomic: true
  verify: false
  wait: true
  timeout: 1200
  recreatePods: true
  disableValidation: true

repositories:
  - name: github
    url: https://actions-runner-controller.github.io/actions-runner-controller
  - name: "incubator"
    url: "https://charts.helm.sh/incubator"
  - name: jetstack
    url: https://charts.jetstack.io

templates:
  default: &default
    namespace: kube-system
    missingFileHandler: Warn
    values:
    - helm/{{`{{ .Release.Name }}`}}/values.yaml
    secrets:
    - helm/{{`{{ .Release.Name }}`}}/secrets.yaml

releases:
  - name: cert-manager
    <<: *default
    namespace: cert-manager
    chart: jetstack/cert-manager
    version: v1.9.1

Then install it executing

helmfile apply

Checks

  • This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-gpu-k8s
spec:
  replicas: 1
  template:
    spec:
      image: summerwind/actions-runner:latest
      nodeSelector:
        cloud.google.com/gke-gpu-partition-size: 1g.5gb
      tolerations:
        - effect: "NoSchedule"
          key: "konstellation.io/gpu"
          operator: "Equal"
          value: "true"
        - effect: "NoSchedule"
          key: "nvidia.com/gpu"
          operator: "Equal"
          value: "present"
      serviceAccountName: "gh-runner-service-account"
      labels:
        - self-hosted
        - gpu
        - k8s
      repository: <organization/repository>
      containerMode: kubernetes
      dockerdWithinRunnerContainer: false
      dockerEnabled: false
      workVolumeClaimTemplate:
        storageClassName: "standard"
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
      resources:
        limits:
          nvidia.com/gpu: 1
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-autoscaler-gpu-k8s
spec:
  minReplicas: 0
  maxReplicas: 1
  scaleTargetRef:
    name: runner-gpu-k8s
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      amount: 1
      duration: "5m"

To Reproduce

1. Deploy Role, RoleBinding, ServiceAccount
2. Deploy Controller
3. Deploy RunnerDeployment
4. Deploy HorizontalRunnerAutoscaler
5. Launch a Github Actions Workflow using a GPU base Docker image


Example workflow


name: GitHub Actions Example GPU - K8S

on: [push, workflow_dispatch]

jobs:
  Explore-GitHub-Actions:
    runs-on: ["self-hosted", "gpu", "k8s"]
    container: 
      image: nvidia/cuda:11.0.3-base-ubi7
    steps:
      - name: Check out repository code
        uses: actions/checkout@v3
      - run: sleep 300 # For debugging purposes, the actual behavior expected is defined below
      - run: ls -ltR /usr/local/nvidia
      - run: /usr/local/nvidia/bin/nvidia-smi -L
      - run: /usr/local/nvidia/bin/nvidia-smi

Describe the bug

Once everything has been deployed, a runner pod is created in the GPU node. This pod has the correct:

  • nodeSelector
  • resources
    • Specifically, the needed limit nvidia.com/gpu: 1
  • tolerations

Once a while, the workflow is launched in a new pod, but this pod doesn’t contains any of the above fields in its manifest, so I don’t have GPU resources, binaries, etc mounted in the pod

Describe the expected behavior

It is expected that the pod that is running the actual workflow inherits or can be configured in such a way that:

  • nodeSelector
  • resources
    • Specifically, the needed limit nvidia.com/gpu: 1
  • tolerations

these fields are specified in the pod allowing me to schedule the pod in a GPU-enabled node, and to configure the resources so GKE NVIDIA device-plugin can read the limits and pass through the GPU resources to the workflow pod.

Controller Logs

2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 1        {"horizontalrunnerautoscaler": "github-gpu-k8s/runner-autoscaler-gpu-k8s", "suggested": 0, "reserved": 1, "min": 0, "max": 1}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7871fd26-9df4-4605-87e2-f02f7734d954", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7871fd26-9df4-4605-87e2-f02f7734d954", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b7df6cca-38ff-4d83-93df-3fb24c5d888f", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    INFO    runnerdeployment-resource       validate resource to be updated {"name": "runner-gpu-k8s"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "b7df6cca-38ff-4d83-93df-3fb24c5d888f", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.horizontalrunnerautoscaler    Calculated desired replicas of 1        {"horizontalrunnerautoscaler": "github-gpu-k8s/runner-autoscaler-gpu-k8s", "suggested": 0, "reserved": 1, "min": 0, "max": 1, "last_scale_up_time": "2022-08-18 12:18:19 +0000 UTC", "scale_down_delay_until": "2022-08-18T12:28:19Z"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "935a04c9-b06f-44fd-9534-14480d8cb775", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "935a04c9-b06f-44fd-9534-14480d8cb775", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "1043ad94-4bf1-4edf-b7ad-e28d0a698765", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "1043ad94-4bf1-4edf-b7ad-e28d0a698765", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7ed13cef-ac61-4ad9-b5d5-d30f15bfb7bc", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z    INFO    runnerdeployment-resource       validate resource to be updated {"name": "runner-gpu-k8s"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7ed13cef-ac61-4ad9-b5d5-d30f15bfb7bc", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "7926c086-c8a9-47df-993f-7ded60a637a7", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-08-18T12:18:19Z    INFO    runnerreplicaset-resource       validate resource to be updated {"name": "runner-gpu-k8s-vzpr9"}
2022-08-18T12:18:19Z    DEBUG   controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "7926c086-c8a9-47df-993f-7ded60a637a7", "allowed": true}
2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.runnerdeployment      Updated runnerreplicaset due to spec change     {"runnerdeployment": "github-gpu-k8s/runner-gpu-k8s", "currentDesiredReplicas": 0, "newDesiredReplicas": 1, "currentEffectiveTime": "2022-08-18 12:18:19 +0000 UTC", "newEffectiveTime": "2022-08-18 12:18:19 +0000 UTC"}
2022-08-18T12:18:19Z    DEBUG   actions-runner-controller.runnerreplicaset      Skipped reconcilation because owner is not synced yet   {"runnerreplicaset": "github-gpu-k8s/runner-gpu-k8s-vzpr9", "owner": "github-gpu-k8s/runner-gpu-k8s-vzpr9-ctmc6", "pods": null}


### Runner Pod Logs

```shell
2022-08-18 12:34:01.768  DEBUG --- Github endpoint URL https://github.com/
2022-08-18 12:34:02.462  DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-08-18 12:34:02.466  DEBUG --- Configuring the runner.

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration




√ Runner successfully added
√ Runner connection is good

# Runner settings


√ Settings Saved.

2022-08-18 12:34:07.485  DEBUG --- Runner successfully configured.
{
  "agentId": 159,
  "agentName": "runner-gpu-k8s-gtrgf-q87wv",
  "poolId": 1,
  "poolName": "Default",
  "ephemeral": true,
  "serverUrl": "https://pipelines.actions.githubusercontent.com/WbUFHxHNMTckMYpByHhGXjzC31kmXmT97NzsUmBl9gVn3Gj7rj",
  "gitHubUrl": "https://github.com/konstellation-io/arc-poc",
  "workFolder": "/runner/_work"
2022-08-18 12:34:07.490  NOTICE --- Docker wait check skipped. Either Docker is disabled or the wait is disabled, continuing with entrypoint
}
√ Connected to GitHub

Current runner version: '2.295.0'
2022-08-18 12:34:09Z: Listening for Jobs
2022-08-18 12:34:14Z: Running job: Explore-GitHub-Actions


### Additional Context

https://github.com/actions-runner-controller/actions-runner-controller/pull/1546
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 6
  • Comments: 19 (1 by maintainers)

Most upvoted comments

👍 here - we need to use tolerations and nodeSelectors to target ARM nodes for faster multi-platform Docker images. Currently dind works but ideally it would be great to use this approach without privileged containers.

https://github.com/actions/runner-container-hooks/pull/75 Solved the issue on the container-hooks side, now I guess this repository should implement its part

I’ve taken a stab at implementing this guys -> https://github.com/actions/actions-runner-controller/pull/3174

Would be very grateful for input of a maintainer! 🙌

@aacecandev Hey! Are you saying that the job pod created when you’re using the kubernetes container mode in addition to the runner pod is missing those fields? Can I take it as the runner pod have all the expected fields but the job pod is not?

Hi @mumoshu, the workflow pod doesn’t have these values. The pod provided by the runnerdeployment is executing with the fields configured as expected, but the workflow pod that the hook launches doesn’t have the same fields, we need to execute the workflow pod with these particular fields to schedule the pod in a GPU node.