gpu-operator: Unable to set the MIG profile of "4g.20gb" via mig manager.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node? --> No, ubuntu 20.04
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
Could not set the “4g.20gb” profile for A100 40GB PCIe card. It showed the following error on mig-maneger when applied this profile.
Unable to validate the selected MIG configuration
Restarting all GPU clients previouly shutdown by reenabling their component-specific nodeSelector labels
node/a100-server-b not labeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
2. Steps to reproduce the issue
- Deployed the pre released v1.8.2 operator.
- Add the custom profile
a100-server-b-balanced:
- devices: all
mig-devices:
4g.20gb: 1
mig-enabled: true
- Changed the label for instance.
3. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-system cattle-cluster-agent-6d969d75b8-x6s8w 1/1 Running 7 78d
cattle-system cattle-node-agent-zbxjz 1/1 Running 7 78d
default gpu-operator-68c95b5679-6298n 1/1 Running 1 14d
default gpu-operator-node-feature-discovery-master-58d884d5cc-2r8dm 1/1 Running 0 36d
default gpu-operator-node-feature-discovery-worker-6zmd8 1/1 Running 0 36d
gpu-operator-resources gpu-feature-discovery-vd6th 1/1 Running 0 12m
gpu-operator-resources nvidia-container-toolkit-daemonset-xdfd9 1/1 Running 0 14d
gpu-operator-resources nvidia-cuda-validator-svvxf 0/1 Completed 0 12m
gpu-operator-resources nvidia-dcgm-exporter-t9hqd 1/1 Running 0 12m
gpu-operator-resources nvidia-dcgm-tvq4k 1/1 Running 0 12m
gpu-operator-resources nvidia-device-plugin-daemonset-nw8mk 1/1 Running 0 12m
gpu-operator-resources nvidia-device-plugin-validator-pv676 0/1 Completed 0 11m
gpu-operator-resources nvidia-driver-daemonset-8swtd 1/1 Running 0 14d
gpu-operator-resources nvidia-mig-manager-vtbtr 1/1 Running 0 14d
gpu-operator-resources nvidia-operator-validator-rqxq4 1/1 Running 0 12m
kube-system calico-kube-controllers-6949477b58-jsprr 1/1 Running 7 78d
kube-system calico-node-p8b2p 1/1 Running 7 78d
kube-system coredns-74ff55c5b-288sr 1/1 Running 7 78d
kube-system coredns-74ff55c5b-nm7vm 1/1 Running 7 78d
kube-system etcd-a100-server-b 1/1 Running 7 78d
kube-system kube-apiserver-a100-server-b 1/1 Running 8 78d
kube-system kube-controller-manager-a100-server-b 1/1 Running 7 78d
kube-system kube-proxy-jkdm4 1/1 Running 7 78d
kube-system kube-scheduler-a100-server-b 1/1 Running 7 78d
monitoring kube-prometheus-stack-grafana-5c5c84b568-4r96r 2/2 Running 12 77d
monitoring kube-prometheus-stack-kube-state-metrics-6f85498dd8-xsdzs 1/1 Running 6 77d
monitoring kube-prometheus-stack-operator-64c65d9fdd-9z7mn 1/1 Running 6 77d
monitoring kube-prometheus-stack-prometheus-node-exporter-lvlkw 1/1 Running 6 77d
monitoring prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 13 77d
- kubernetes daemonset status:
kubectl get ds --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
cattle-system cattle-node-agent 1 1 1 1 1 <none> 78d
default gpu-operator-node-feature-discovery-worker 1 1 1 1 1 <none> 36d
gpu-operator-resources gpu-feature-discovery 1 1 1 1 1 nvidia.com/gpu.deploy.gpu-feature-discovery=true 36d
gpu-operator-resources nvidia-container-toolkit-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.container-toolkit=true 36d
gpu-operator-resources nvidia-dcgm 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm=true 36d
gpu-operator-resources nvidia-dcgm-exporter 1 1 1 1 1 nvidia.com/gpu.deploy.dcgm-exporter=true 36d
gpu-operator-resources nvidia-device-plugin-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.device-plugin=true 36d
gpu-operator-resources nvidia-driver-daemonset 1 1 1 1 1 nvidia.com/gpu.deploy.driver=true 36d
gpu-operator-resources nvidia-mig-manager 1 1 1 1 1 nvidia.com/gpu.deploy.mig-manager=true 36d
gpu-operator-resources nvidia-operator-validator 1 1 1 1 1 nvidia.com/gpu.deploy.operator-validator=true 36d
kube-system calico-node 1 1 1 1 1 kubernetes.io/os=linux 78d
kube-system kube-proxy 1 1 1 1 1 kubernetes.io/os=linux 78d
monitoring kube-prometheus-stack-prometheus-node-exporter 1 1 1 1 1 <none> 77d
- Docker configuration file:
cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"exec-opts": [
"native.cgroupdriver=systemd"
],
"runtimes": {
"nvidia": {
"args": [],
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime"
},
"nvidia-experimental": {
"args": [],
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
}
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
- Docker runtime configuration:
docker info | grep Runtime
Runtimes: nvidia nvidia-experimental runc
WARNING: No swap limit support
Default Runtime: nvidia
- NVIDIA shared directory:
ls -la /run/nvidia
total 12
drwxr-xr-x 4 root root 120 Sep 7 05:00 .
drwxr-xr-x 40 root root 1340 Sep 21 11:59 ..
drwxr-xr-x 1 root root 4096 Sep 7 05:00 driver
-rw-r--r-- 1 root root 8 Sep 7 05:00 nvidia-driver.pid
-rw-r--r-- 1 root root 8 Sep 7 04:59 toolkit.pid
drwxr-xr-x 2 root root 120 Sep 21 11:48 validations
- NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
total 8552
drwxr-xr-x 3 root root 4096 Sep 7 04:59 .
drwxr-xr-x 3 root root 4096 Sep 7 04:59 ..
drwxr-xr-x 3 root root 4096 Sep 7 04:59 .config
lrwxrwxrwx 1 root root 28 Sep 7 04:59 libnvidia-container.so.1 -> libnvidia-container.so.1.4.0
-rwxr-xr-x 1 root root 175120 Sep 7 04:59 libnvidia-container.so.1.4.0
-rwxr-xr-x 1 root root 154 Sep 7 04:59 nvidia-container-cli
-rwxr-xr-x 1 root root 43024 Sep 7 04:59 nvidia-container-cli.real
-rwxr-xr-x 1 root root 342 Sep 7 04:59 nvidia-container-runtime
-rwxr-xr-x 1 root root 429 Sep 7 04:59 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3991000 Sep 7 04:59 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root 24 Sep 7 04:59 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2359384 Sep 7 04:59 nvidia-container-runtime.real
-rwxr-xr-x 1 root root 198 Sep 7 04:59 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Sep 7 04:59 nvidia-container-toolkit.real
- NVIDIA driver directory:
ls -la /run/nvidia/driver
total 88
drwxr-xr-x 1 root root 4096 Sep 7 05:00 .
drwxr-xr-x 4 root root 120 Sep 7 05:00 ..
lrwxrwxrwx 1 root root 7 Jul 23 17:35 bin -> usr/bin
drwxr-xr-x 2 root root 4096 Apr 15 2020 boot
drwxr-xr-x 17 root root 4300 Sep 7 05:01 dev
-rwxr-xr-x 1 root root 0 Sep 7 05:00 .dockerenv
drwxr-xr-x 1 root root 4096 Sep 7 05:00 drivers
drwxr-xr-x 1 root root 4096 Sep 7 05:01 etc
drwxr-xr-x 2 root root 4096 Apr 15 2020 home
drwxr-xr-x 2 root root 4096 Sep 7 05:00 host-etc
lrwxrwxrwx 1 root root 7 Jul 23 17:35 lib -> usr/lib
lrwxrwxrwx 1 root root 9 Jul 23 17:35 lib32 -> usr/lib32
lrwxrwxrwx 1 root root 9 Jul 23 17:35 lib64 -> usr/lib64
lrwxrwxrwx 1 root root 10 Jul 23 17:35 libx32 -> usr/libx32
drwxr-xr-x 2 root root 4096 Jul 23 17:35 media
drwxr-xr-x 2 root root 4096 Jul 23 17:35 mnt
-rw-r--r-- 1 root root 16047 Aug 3 20:33 NGC-DL-CONTAINER-LICENSE
drwxr-xr-x 2 root root 4096 Jul 23 17:35 opt
dr-xr-xr-x 1279 root root 0 Aug 16 01:57 proc
drwx------ 2 root root 4096 Jul 23 17:38 root
drwxr-xr-x 1 root root 4096 Sep 7 05:01 run
lrwxrwxrwx 1 root root 8 Jul 23 17:35 sbin -> usr/sbin
drwxr-xr-x 2 root root 4096 Jul 23 17:35 srv
dr-xr-xr-x 13 root root 0 Sep 7 05:00 sys
drwxrwxrwt 1 root root 4096 Sep 7 05:01 tmp
drwxr-xr-x 1 root root 4096 Jul 23 17:35 usr
drwxr-xr-x 1 root root 4096 Jul 23 17:38 var
- kubelet logs
journalctl -u kubelet > kubelet.logs
If you need, please let me know.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (7 by maintainers)
In general, the
k8s-mig-manager
does not automate the process of ensuring that it is “OK” to change the MIG configuration on the GPUs that it manages. It assumes that some higher level entity is coordinating the setting of themig.config
label on the node when it is OK to do so.At present, best practice before applying a MIG config change to any GPU is to:
mig.config
label to change the MIG configWe didn’t want to be prescriptive and automate all of these steps in the
k8s-mig-manager
itself because SRE teams typically already have their own process for doing 1, 2 and 5, and now they can just integrate the setting of themig.config
label into that process when desired.