k8s-device-plugin: Getting GPU device minor number: Not Supported
1. Issue or feature description
helm install nvidia-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.12.2
nvidia-device-plugin-ctr logs
2022/09/06 15:24:00 Starting FS watcher.
2022/09/06 15:24:00 Starting OS watcher.
2022/09/06 15:24:00 Starting Plugins.
2022/09/06 15:24:00 Loading configuration.
2022/09/06 15:24:00 Initializing NVML.
2022/09/06 15:24:00 Updating config with default resource matching patterns.
2022/09/06 15:24:00
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"nvidiaDriverRoot": "/",
"plugin": {
"passDeviceSpecs": true,
"deviceListStrategy": "envvar",
"deviceIDStrategy": "index"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
2022/09/06 15:24:00 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error building GPU Device: error getting device paths: error getting GPU device minor number: Not Supported
goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc000010a30)
/build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e5c58?, {0xc0001cc460, 0x9, 0xe}, 0x9?)
/build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001cc460, 0x9, 0xe})
/build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001cc460?)
/build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001e8820, {0xca9328?, 0xc00003a050}, {0xc000032230, 0x1, 0x1})
/build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
/build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
/build/cmd/nvidia-device-plugin/main.go:91 +0x665
When I use ctr to run test gpu is ok
ctr run --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:nbody test-gpu /tmp/nbody -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1
> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 3GB]
9216 bodies, total time for 10 iterations: 7.467 ms
= 113.747 billion interactions per second
= 2274.931 single-precision GFLOP/s at 20 flops per interaction
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- The output of
nvidia-smi -a
on your host
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Tue Sep 6 15:30:06 2022
Driver Version : 516.94
CUDA Version : 11.7
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce GTX 1060 3GB
Product Brand : GeForce
Product Architecture : Pascal
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : WDDM
Pending : WDDM
Serial Number : N/A
GPU UUID : GPU-9445de88-eb50-477d-ff7c-5e0d77cdb203
Minor Number : N/A
VBIOS Version : 86.06.3c.00.2e
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1C0210DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x11C210DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 8000 KB/s
Fan Speed : 42 %
Performance State : P5
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 3072 MiB
Reserved : 84 MiB
Used : 2407 MiB
Free : 580 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 3 %
Memory : 5 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 45 C
GPU Shutdown Temp : 102 C
GPU Slowdown Temp : 99 C
GPU Max Operating Temp : N/A
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 12.16 W
Power Limit : 120.00 W
Default Power Limit : 120.00 W
Enforced Power Limit : 120.00 W
Min Power Limit : 60.00 W
Max Power Limit : 140.00 W
Clocks
Graphics : 683 MHz
SM : 683 MHz
Memory : 810 MHz
Video : 607 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 4004 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Processes : None
- Your docker configuration file (e.g:
/etc/docker/daemon.json
) - The k8s-device-plugin container logs
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)
Additional information that might help better understand your environment and reproduce the bug:
- Any relevant kernel output lines from
dmesg
- NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.10.0-1 all NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
- NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
nvidia-container-cli list
/dev/dxg
/usr/lib/wsl/drivers/nv_dispi.inf_amd64_47917a79b8c7fd22/nvidia-smi
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libdxcore.so
continaerd config containerd.toml
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvdia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
no_pivot = false
snapshotter = "overlayfs"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
Runtime = "nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runtime.v1.linux"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Runtime = "nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
base_runtime_spec = ""
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = ""
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]
[plugins."io.containerd.grpc.v1.cri".image_decryption]
key_model = "node"
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = ""
[plugins."io.containerd.grpc.v1.cri".registry.auths]
[plugins."io.containerd.grpc.v1.cri".registry.configs]
[plugins."io.containerd.grpc.v1.cri".registry.headers]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "nvidia-container-runtime"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 27 (9 by maintainers)
✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
WSL environment
K8S Setup
nvidia-smi output in WSL
nvidia-device-plugin daemonset pod log
Test GPU pod output
Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support
Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !
@davidshen84 I can also confirm it works. However, we have to add some additional stuff:
Annotate the WSL node:
Change device plugin in ClusterPolicy:
It should work for now:
I believe
registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
would be the right image right?Hi @elezar,
I’m also interested in running the device plugin with WSL2. I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291
Would be great to get those changes in.