k8s-device-plugin: Getting GPU device minor number: Not Supported


1. Issue or feature description

helm install nvidia-device-plugin

 helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.12.2 

nvidia-device-plugin-ctr logs

2022/09/06 15:24:00 Starting FS watcher.
2022/09/06 15:24:00 Starting OS watcher.
2022/09/06 15:24:00 Starting Plugins.
2022/09/06 15:24:00 Loading configuration.
2022/09/06 15:24:00 Initializing NVML.
2022/09/06 15:24:00 Updating config with default resource matching patterns.
2022/09/06 15:24:00 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "index"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2022/09/06 15:24:00 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error building GPU Device: error getting device paths: error getting GPU device minor number: Not Supported

goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc000010a30)
	/build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e5c58?, {0xc0001cc460, 0x9, 0xe}, 0x9?)
	/build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001cc460, 0x9, 0xe})
	/build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001cc460?)
	/build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001e8820, {0xca9328?, 0xc00003a050}, {0xc000032230, 0x1, 0x1})
	/build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
	/build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
	/build/cmd/nvidia-device-plugin/main.go:91 +0x665

When I use ctr to run test gpu is ok

ctr run --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:nbody test-gpu /tmp/nbody -benchmark

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 3GB]
9216 bodies, total time for 10 iterations: 7.467 ms
= 113.747 billion interactions per second
= 2274.931 single-precision GFLOP/s at 20 flops per interaction

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
 nvidia-smi -a
 
 ==============NVSMI LOG==============
 
 Timestamp                                 : Tue Sep  6 15:30:06 2022
 Driver Version                            : 516.94
 CUDA Version                              : 11.7
 
 Attached GPUs                             : 1
 GPU 00000000:01:00.0
     Product Name                          : NVIDIA GeForce GTX 1060 3GB
     Product Brand                         : GeForce
     Product Architecture                  : Pascal
     Display Mode                          : Enabled
     Display Active                        : Enabled
     Persistence Mode                      : Enabled
     MIG Mode
         Current                           : N/A
         Pending                           : N/A
     Accounting Mode                       : Disabled
     Accounting Mode Buffer Size           : 4000
     Driver Model
         Current                           : WDDM
         Pending                           : WDDM
     Serial Number                         : N/A
     GPU UUID                              : GPU-9445de88-eb50-477d-ff7c-5e0d77cdb203
     Minor Number                          : N/A
     VBIOS Version                         : 86.06.3c.00.2e
     MultiGPU Board                        : No
     Board ID                              : 0x100
     GPU Part Number                       : N/A
     Module ID                             : 0
     Inforom Version
         Image Version                     : G001.0000.01.04
         OEM Object                        : 1.1
         ECC Object                        : N/A
         Power Management Object           : N/A
     GPU Operation Mode
         Current                           : N/A
         Pending                           : N/A
     GSP Firmware Version                  : N/A
     GPU Virtualization Mode
         Virtualization Mode               : None
         Host VGPU Mode                    : N/A
     IBMNPU
         Relaxed Ordering Mode             : N/A
     PCI
         Bus                               : 0x01
         Device                            : 0x00
         Domain                            : 0x0000
         Device Id                         : 0x1C0210DE
         Bus Id                            : 00000000:01:00.0
         Sub System Id                     : 0x11C210DE
         GPU Link Info
             PCIe Generation
                 Max                       : 3
                 Current                   : 3
             Link Width
                 Max                       : 16x
                 Current                   : 16x
         Bridge Chip
             Type                          : N/A
             Firmware                      : N/A
         Replays Since Reset               : 0
         Replay Number Rollovers           : 0
         Tx Throughput                     : 0 KB/s
         Rx Throughput                     : 8000 KB/s
     Fan Speed                             : 42 %
     Performance State                     : P5
     Clocks Throttle Reasons
         Idle                              : Active
         Applications Clocks Setting       : Not Active
         SW Power Cap                      : Not Active
         HW Slowdown                       : Not Active
             HW Thermal Slowdown           : Not Active
             HW Power Brake Slowdown       : Not Active
         Sync Boost                        : Not Active
         SW Thermal Slowdown               : Not Active
         Display Clock Setting             : Not Active
     FB Memory Usage
         Total                             : 3072 MiB
         Reserved                          : 84 MiB
         Used                              : 2407 MiB
         Free                              : 580 MiB
     BAR1 Memory Usage
         Total                             : 256 MiB
         Used                              : 2 MiB
         Free                              : 254 MiB
     Compute Mode                          : Default
     Utilization
         Gpu                               : 3 %
         Memory                            : 5 %
         Encoder                           : 0 %
         Decoder                           : 0 %
     Encoder Stats
         Active Sessions                   : 0
         Average FPS                       : 0
         Average Latency                   : 0
     FBC Stats
         Active Sessions                   : 0
         Average FPS                       : 0
         Average Latency                   : 0
     Ecc Mode
         Current                           : N/A
         Pending                           : N/A
     ECC Errors
         Volatile
             Single Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
             Double Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
         Aggregate
             Single Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
             Double Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
     Retired Pages
         Single Bit ECC                    : N/A
         Double Bit ECC                    : N/A
         Pending Page Blacklist            : N/A
     Remapped Rows                         : N/A
     Temperature
         GPU Current Temp                  : 45 C
         GPU Shutdown Temp                 : 102 C
         GPU Slowdown Temp                 : 99 C
         GPU Max Operating Temp            : N/A
         GPU Target Temperature            : 83 C
         Memory Current Temp               : N/A
         Memory Max Operating Temp         : N/A
     Power Readings
         Power Management                  : Supported
         Power Draw                        : 12.16 W
         Power Limit                       : 120.00 W
         Default Power Limit               : 120.00 W
         Enforced Power Limit              : 120.00 W
         Min Power Limit                   : 60.00 W
         Max Power Limit                   : 140.00 W
     Clocks
         Graphics                          : 683 MHz
         SM                                : 683 MHz
         Memory                            : 810 MHz
         Video                             : 607 MHz
     Applications Clocks
         Graphics                          : N/A
         Memory                            : N/A
     Default Applications Clocks
         Graphics                          : N/A
         Memory                            : N/A
     Max Clocks
         Graphics                          : 1911 MHz
         SM                                : 1911 MHz
         Memory                            : 4004 MHz
         Video                             : 1708 MHz
     Max Customer Boost Clocks
         Graphics                          : N/A
     Clock Policy
         Auto Boost                        : N/A
         Auto Boost Default                : N/A
     Voltage
         Graphics                          : N/A
     Processes                             : None
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.10.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.10.0-1     amd64        NVIDIA container runtime library
ii  nvidia-container-runtime      3.10.0-1     all          NVIDIA container runtime
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.10.0-1     amd64        NVIDIA container runtime hook
  • NVIDIA container library version from nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
nvidia-container-cli list  
/dev/dxg
/usr/lib/wsl/drivers/nv_dispi.inf_amd64_47917a79b8c7fd22/nvidia-smi
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libdxcore.so

continaerd config containerd.toml

[plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvdia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runtime.v1.linux"

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
        Runtime = "nvidia-container-runtime"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runtime.v1.linux"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = false

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 27 (9 by maintainers)

Most upvoted comments

✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016

WSL environment

WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.3208

K8S Setup

≥ k3s --version                                                                                                                    
k3s version v1.26.4+k3s1 (8d0255af)
go version go1.19.8

nvidia-smi output in WSL

Tue Jul 25 16:36:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8               3W /  40W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

##deleted processes table##

nvidia-device-plugin daemonset pod log

I0725 06:26:03.108417       1 main.go:154] Starting FS watcher.
I0725 06:26:03.108468       1 main.go:161] Starting OS watcher.
I0725 06:26:03.108974       1 main.go:176] Starting Plugins.
I0725 06:26:03.108995       1 main.go:234] Loading configuration.
I0725 06:26:03.109063       1 main.go:242] Updating config with default resource matching patterns.
I0725 06:26:03.109205       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0725 06:26:03.109219       1 main.go:256] Retrieving plugins.
I0725 06:26:03.113336       1 factory.go:107] Detected NVML platform: found NVML library
I0725 06:26:03.113372       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0725 06:26:03.138677       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0725 06:26:03.139033       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0725 06:26:03.143248       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Test GPU pod output

Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
    -fullscreen       (run n-body simulation in fullscreen mode)
    -fp64             (use double precision floating point values for simulation)
    -hostmem          (stores simulation data in host memory)
    -benchmark        (run benchmark to measure performance) 
    -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
    -device=<d>       (where d=0,1,2.... for the CUDA device to use)
    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
    -compare          (compares simulation results running once on the default GPU and once on the CPU)
    -cpu              (run n-body simulation on the CPU)
    -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA RTX A2000 8GB Laptop GPU]
20480 bodies, total time for 10 iterations: 25.066 ms
= 167.327 billion interactions per second
= 3346.542 single-precision GFLOP/s at 20 flops per interaction
Stream closed EOF for default/nbody-gpu-benchmark (cuda-container)

Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !

@davidshen84 I can also confirm it works. However, we have to add some additional stuff:

$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg

Annotate the WSL node:

    nvidia.com/gpu-driver-upgrade-state: pod-restart-required
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true'
    nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    nvidia.com/gpu.deploy.device-plugin: 'true'
    nvidia.com/gpu.deploy.driver: 'true'
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    nvidia.com/gpu.deploy.node-status-exporter: 'true'
    nvidia.com/gpu.deploy.nvsm: ''
    nvidia.com/gpu.deploy.operands: 'true'
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/device-plugin.config: 'RTX-4070-Ti'

Change device plugin in ClusterPolicy:

  devicePlugin:
    config:
      name: time-slicing-config
    enabled: true
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
    version: 8b416016

It should work for now:


> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070 Ti]
61440 bodies, total time for 10 iterations: 34.665 ms
= 1088.943 billion interactions per second
= 21778.869 single-precision GFLOP/s at 20 flops per interaction

I believe registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 would be the right image right?

Hi @elezar,

I’m also interested in running the device plugin with WSL2. I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291

Would be great to get those changes in.