k8s-device-plugin: Resource type labelling is incomplete/incorrect
Hi,
I am using nvidia-plugin version - v0.7.0 gpu-feature-discovery version - v0.4.1 k8s version - 1.20.2
on my A100 gpu machine:
nvidia-docker version
NVIDIA Docker: 2.6.0
nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 0 13 MIG 1g.5gb 0 0 0:1 |
+--------------------------------------------------------------------+
| 0 5 MIG 2g.10gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 0 1 MIG 3g.20gb 2 0 0:3 |
+--------------------------------------------------------------------+
cat /etc/docker/daemon.json
{
"log-driver":"json-file",
"log-opts": { "max-size" : "10m", "max-file" : "10" }
, "runtimes": { "nvidia": { "path": "/usr\/bin\/nvidia-container-runtime","runtimeArgs": []}}
, "default-runtime" : "nvidia"
}
I am able to successfully deployed gpu-feature-discovery pods as well as nvidia-plugin pods.
kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-6rdc8 1/1 Running 0 24m
nfd-master-6dd87d999-4xlrw 1/1 Running 0 24m
nfd-worker-rrwms 1/1 Running 0 24m
nvidia-device-plugin-fsq2j 1/1 Running 0 24m
nvidiagpubeat-59jd5 1/1 Running 0 88m
labels have been applied for this MIG strategy: mixed on my A100 gpu as below:
kubectl get node GPU_NODE -o yaml
labels:
....
...
nvidia.com/cuda.driver.major: "450"
nvidia.com/cuda.driver.minor: "80"
nvidia.com/cuda.driver.rev: "02"
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "0"
nvidia.com/gfd.timestamp: "1626160093"
nvidia.com/gpu.compute.major: "8"
nvidia.com/gpu.compute.minor: "0"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.family: ampere
nvidia.com/gpu.machine: ProLiant-DL380-Gen10
nvidia.com/gpu.memory: "40537"
nvidia.com/gpu.product: A100-PCIE-40GB
nvidia.com/mig-1g.5gb.count: "1"
nvidia.com/mig-1g.5gb.engines.copy: "1"
nvidia.com/mig-1g.5gb.engines.decoder: "0"
nvidia.com/mig-1g.5gb.engines.encoder: "0"
nvidia.com/mig-1g.5gb.engines.jpeg: "0"
nvidia.com/mig-1g.5gb.engines.ofa: "0"
nvidia.com/mig-1g.5gb.memory: "4864"
nvidia.com/mig-1g.5gb.multiprocessors: "14"
nvidia.com/mig-1g.5gb.slices.ci: "1"
nvidia.com/mig-1g.5gb.slices.gi: "1"
nvidia.com/mig-2g.10gb.count: "1"
nvidia.com/mig-2g.10gb.engines.copy: "2"
nvidia.com/mig-2g.10gb.engines.decoder: "1"
nvidia.com/mig-2g.10gb.engines.encoder: "0"
nvidia.com/mig-2g.10gb.engines.jpeg: "0"
nvidia.com/mig-2g.10gb.engines.ofa: "0"
nvidia.com/mig-2g.10gb.memory: "9984"
nvidia.com/mig-2g.10gb.multiprocessors: "28"
nvidia.com/mig-2g.10gb.slices.ci: "2"
nvidia.com/mig-2g.10gb.slices.gi: "2"
nvidia.com/mig-3g.20gb.count: "1"
nvidia.com/mig-3g.20gb.engines.copy: "3"
nvidia.com/mig-3g.20gb.engines.decoder: "2"
nvidia.com/mig-3g.20gb.engines.encoder: "0"
nvidia.com/mig-3g.20gb.engines.jpeg: "0"
nvidia.com/mig-3g.20gb.engines.ofa: "0"
nvidia.com/mig-3g.20gb.memory: "20096"
nvidia.com/mig-3g.20gb.multiprocessors: "42"
nvidia.com/mig-3g.20gb.slices.ci: "3"
nvidia.com/mig-3g.20gb.slices.gi: "3"
nvidia.com/mig.strategy: mixed
...
...
gpu-feature-discovery pod is working correctly However problem is with resource type on my A100 gpu node.
I would expect to get as below:
kubectl describe node
...
Capacity:
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
...
Allocatable:
nvidia.com/mig-1g.5gb: 1
nvidia.com/mig-2g.10gb: 1
nvidia.com/mig-3g.20gb: 1
I am getting
kubectl describe node
...
Capacity:
nvidia.com/gpu: 0
...
Allocatable:
nvidia.com/gpu: 0
...
Also, getting below error while checking nvidia-plugin logs.
kubectl -n kube-system logs nvidia-device-plugin-fsq2j 2021/07/14 22:45:43 Loading NVML 2021/07/14 22:45:43 Starting FS watcher. 2021/07/14 22:45:43 Starting OS watcher. 2021/07/14 22:45:43 Retreiving plugins. 2021/07/14 22:45:43 No devices found. Waiting indefinitely.
I am going through docs but not able to figured it out what is the issue going on. Any help would be much appreciated.
Thank you
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 19 (7 by maintainers)
Hi @elezar @klueska Finally, I got the solution - https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_MIG_User_Guide.pdf
“Toggling MIG mode requires the CAP_SYS_ADMIN capability. Other MIG management, such as creating and destroying instances, requires superuser by default, but can be delegated to non privileged users by adjusting permissions to MIG capabilities in /proc/” - page:12.
https://github.com/NVIDIA/nvidia-container-runtime
set env NVIDIA_MIG_CONFIG_DEVICES = all
Thank you so much for your support.
PS - Currently I have only tested with mig-strategy = single need to test with mixed and will update the ticket soon!