actions-runner-controller: Cannot pass nodeSelector, tolarations and resources in containerMode: kubernetes
Controller Version
0.25.2
Helm Chart Version
0.20.2
CertManager Version
1.9.1
Deployment Method
Helm
cert-manager installation
Cert-manager is installed using helmfile
helm/
cert-manager/
values.yaml
helmfile.yaml
Contents of values.yaml
installCRDs: true
Contents of helmfile.yaml
helmDefaults:
createNamespace: true
atomic: true
verify: false
wait: true
timeout: 1200
recreatePods: true
disableValidation: true
repositories:
- name: github
url: https://actions-runner-controller.github.io/actions-runner-controller
- name: "incubator"
url: "https://charts.helm.sh/incubator"
- name: jetstack
url: https://charts.jetstack.io
templates:
default: &default
namespace: kube-system
missingFileHandler: Warn
values:
- helm/{{`{{ .Release.Name }}`}}/values.yaml
secrets:
- helm/{{`{{ .Release.Name }}`}}/secrets.yaml
releases:
- name: cert-manager
<<: *default
namespace: cert-manager
chart: jetstack/cert-manager
version: v1.9.1
Then install it executing
helmfile apply
Checks
- This isn’t a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- I’ve read releasenotes before submitting this issue and I’m sure it’s not due to any recently-introduced backward-incompatible changes
- My actions-runner-controller version (v0.x.y) does support the feature
- I’ve already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn’t fix the issue
Resource Definitions
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner-gpu-k8s
spec:
replicas: 1
template:
spec:
image: summerwind/actions-runner:latest
nodeSelector:
cloud.google.com/gke-gpu-partition-size: 1g.5gb
tolerations:
- effect: "NoSchedule"
key: "konstellation.io/gpu"
operator: "Equal"
value: "true"
- effect: "NoSchedule"
key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
serviceAccountName: "gh-runner-service-account"
labels:
- self-hosted
- gpu
- k8s
repository: <organization/repository>
containerMode: kubernetes
dockerdWithinRunnerContainer: false
dockerEnabled: false
workVolumeClaimTemplate:
storageClassName: "standard"
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: runner-autoscaler-gpu-k8s
spec:
minReplicas: 0
maxReplicas: 1
scaleTargetRef:
name: runner-gpu-k8s
scaleUpTriggers:
- githubEvent:
workflowJob: {}
amount: 1
duration: "5m"
To Reproduce
1. Deploy Role, RoleBinding, ServiceAccount
2. Deploy Controller
3. Deploy RunnerDeployment
4. Deploy HorizontalRunnerAutoscaler
5. Launch a Github Actions Workflow using a GPU base Docker image
Example workflow
name: GitHub Actions Example GPU - K8S
on: [push, workflow_dispatch]
jobs:
Explore-GitHub-Actions:
runs-on: ["self-hosted", "gpu", "k8s"]
container:
image: nvidia/cuda:11.0.3-base-ubi7
steps:
- name: Check out repository code
uses: actions/checkout@v3
- run: sleep 300 # For debugging purposes, the actual behavior expected is defined below
- run: ls -ltR /usr/local/nvidia
- run: /usr/local/nvidia/bin/nvidia-smi -L
- run: /usr/local/nvidia/bin/nvidia-smi
Describe the bug
Once everything has been deployed, a runner pod is created in the GPU node. This pod has the correct:
- nodeSelector
- resources
- Specifically, the needed limit
nvidia.com/gpu: 1
- Specifically, the needed limit
- tolerations
Once a while, the workflow is launched in a new pod, but this pod doesn’t contains any of the above fields in its manifest, so I don’t have GPU resources, binaries, etc mounted in the pod
Describe the expected behavior
It is expected that the pod that is running the actual workflow inherits or can be configured in such a way that:
- nodeSelector
- resources
- Specifically, the needed limit
nvidia.com/gpu: 1
- Specifically, the needed limit
- tolerations
these fields are specified in the pod allowing me to schedule the pod in a GPU-enabled node, and to configure the resources so GKE NVIDIA device-plugin can read the limits and pass through the GPU resources to the workflow pod.
Controller Logs
2022-08-18T12:18:19Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 1 {"horizontalrunnerautoscaler": "github-gpu-k8s/runner-autoscaler-gpu-k8s", "suggested": 0, "reserved": 1, "min": 0, "max": 1}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7871fd26-9df4-4605-87e2-f02f7734d954", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7871fd26-9df4-4605-87e2-f02f7734d954", "allowed": true}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "b7df6cca-38ff-4d83-93df-3fb24c5d888f", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z INFO runnerdeployment-resource validate resource to be updated {"name": "runner-gpu-k8s"}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "b7df6cca-38ff-4d83-93df-3fb24c5d888f", "allowed": true}
2022-08-18T12:18:19Z DEBUG actions-runner-controller.horizontalrunnerautoscaler Calculated desired replicas of 1 {"horizontalrunnerautoscaler": "github-gpu-k8s/runner-autoscaler-gpu-k8s", "suggested": 0, "reserved": 1, "min": 0, "max": 1, "last_scale_up_time": "2022-08-18 12:18:19 +0000 UTC", "scale_down_delay_until": "2022-08-18T12:28:19Z"}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "935a04c9-b06f-44fd-9534-14480d8cb775", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "935a04c9-b06f-44fd-9534-14480d8cb775", "allowed": true}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "1043ad94-4bf1-4edf-b7ad-e28d0a698765", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "1043ad94-4bf1-4edf-b7ad-e28d0a698765", "allowed": true}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "UID": "7ed13cef-ac61-4ad9-b5d5-d30f15bfb7bc", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerdeployments"}}
2022-08-18T12:18:19Z INFO runnerdeployment-resource validate resource to be updated {"name": "runner-gpu-k8s"}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment", "code": 200, "reason": "", "UID": "7ed13cef-ac61-4ad9-b5d5-d30f15bfb7bc", "allowed": true}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks received request {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "UID": "7926c086-c8a9-47df-993f-7ded60a637a7", "kind": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runnerreplicasets"}}
2022-08-18T12:18:19Z INFO runnerreplicaset-resource validate resource to be updated {"name": "runner-gpu-k8s-vzpr9"}
2022-08-18T12:18:19Z DEBUG controller-runtime.webhook.webhooks wrote response {"webhook": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset", "code": 200, "reason": "", "UID": "7926c086-c8a9-47df-993f-7ded60a637a7", "allowed": true}
2022-08-18T12:18:19Z DEBUG actions-runner-controller.runnerdeployment Updated runnerreplicaset due to spec change {"runnerdeployment": "github-gpu-k8s/runner-gpu-k8s", "currentDesiredReplicas": 0, "newDesiredReplicas": 1, "currentEffectiveTime": "2022-08-18 12:18:19 +0000 UTC", "newEffectiveTime": "2022-08-18 12:18:19 +0000 UTC"}
2022-08-18T12:18:19Z DEBUG actions-runner-controller.runnerreplicaset Skipped reconcilation because owner is not synced yet {"runnerreplicaset": "github-gpu-k8s/runner-gpu-k8s-vzpr9", "owner": "github-gpu-k8s/runner-gpu-k8s-vzpr9-ctmc6", "pods": null}
### Runner Pod Logs
```shell
2022-08-18 12:34:01.768 DEBUG --- Github endpoint URL https://github.com/
2022-08-18 12:34:02.462 DEBUG --- Passing --ephemeral to config.sh to enable the ephemeral runner.
2022-08-18 12:34:02.466 DEBUG --- Configuring the runner.
--------------------------------------------------------------------------------
| ____ _ _ _ _ _ _ _ _ |
| / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
| | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
| | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
| \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
| |
| Self-hosted runner registration |
| |
--------------------------------------------------------------------------------
# Authentication
√ Connected to GitHub
# Runner Registration
√ Runner successfully added
√ Runner connection is good
# Runner settings
√ Settings Saved.
2022-08-18 12:34:07.485 DEBUG --- Runner successfully configured.
{
"agentId": 159,
"agentName": "runner-gpu-k8s-gtrgf-q87wv",
"poolId": 1,
"poolName": "Default",
"ephemeral": true,
"serverUrl": "https://pipelines.actions.githubusercontent.com/WbUFHxHNMTckMYpByHhGXjzC31kmXmT97NzsUmBl9gVn3Gj7rj",
"gitHubUrl": "https://github.com/konstellation-io/arc-poc",
"workFolder": "/runner/_work"
2022-08-18 12:34:07.490 NOTICE --- Docker wait check skipped. Either Docker is disabled or the wait is disabled, continuing with entrypoint
}
√ Connected to GitHub
Current runner version: '2.295.0'
2022-08-18 12:34:09Z: Listening for Jobs
2022-08-18 12:34:14Z: Running job: Explore-GitHub-Actions
### Additional Context
https://github.com/actions-runner-controller/actions-runner-controller/pull/1546
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus-multi
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 6
- Comments: 19 (1 by maintainers)
👍 here - we need to use tolerations and nodeSelectors to target ARM nodes for faster multi-platform Docker images. Currently
dind
works but ideally it would be great to use this approach without privileged containers.https://github.com/actions/runner-container-hooks/pull/75 Solved the issue on the container-hooks side, now I guess this repository should implement its part
I’ve taken a stab at implementing this guys -> https://github.com/actions/actions-runner-controller/pull/3174
Would be very grateful for input of a maintainer! 🙌
Hi @mumoshu, the workflow pod doesn’t have these values. The pod provided by the runnerdeployment is executing with the fields configured as expected, but the workflow pod that the hook launches doesn’t have the same fields, we need to execute the workflow pod with these particular fields to schedule the pod in a GPU node.