gpu-operator: Error when trying to use operator on DGX A100-80GB with microk8s and mixed strategy MIG

1. Issue or feature description

On a DGX A100-80GB, trying to install the operator with mixed strategy MIG, feature discovery/node labeling work fine with MIG disabled, but as soon as I set a MIG config label on the node and mig-manager reconfigures the GPUs, discovery and labeling of GPUs stop working.

2. Steps to reproduce the issue

System info:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal
$ dpkg -l | grep nvidia
ii  libnvidia-cfg1-470-server:amd64      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-470-server          470.103.01-0ubuntu0.20.04.1             all          Shared files used by the NVIDIA libraries
ii  libnvidia-compute-470-server:amd64   470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA libcompute package
ii  libnvidia-container-tools            1.7.0-1                                 amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.7.0-1                                 amd64        NVIDIA container runtime library
ii  libnvidia-decode-470-server:amd64    470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-470-server:amd64    470.103.01-0ubuntu0.20.04.1             amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-470-server:amd64     470.103.01-0ubuntu0.20.04.1             amd64        Extra libraries for the NVIDIA Server Driver
ii  libnvidia-fbc1-470-server:amd64      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-470-server:amd64        470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-ifr1-470-server:amd64      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  nvidia-acs-disable                   19.12.0                                 amd64        Disables the PCIe ACS capability
ii  nvidia-compute-utils-470-server      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA compute utilities
ii  nvidia-conf-cachefilesd              20.06-1                                 amd64        Systemd settings for cachefilesd
ii  nvidia-container-runtime             3.7.0-1                                 all          NVIDIA container runtime
ii  nvidia-container-toolkit             1.7.0-1                                 amd64        NVIDIA container runtime hook
ii  nvidia-crashdump                     20.12-1                                 amd64        NVIDA crash dump policy
ii  nvidia-dcgm-enable                   21.07-1                                 all          Enable DCGM
ii  nvidia-disable-iscsid                20.06-1                                 all          Disable iscsid on NVIDIA platforms that don't support it
ii  nvidia-dkms-470-server               470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA DKMS package
ii  nvidia-driver-470-server             470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA Server Driver metapackage
ii  nvidia-enable-journaling             20.06-1                                 all          Package that enables journal_data on root file system
ii  nvidia-fabricmanager-470             470.103.01-0ubuntu0.20.04.1             amd64        Fabric Manager for NVSwitch based systems.
ii  nvidia-icmp                          20.06-1                                 amd64        DGX iptable settings
ii  nvidia-ipmisol                       21.01-1                                 amd64        Enable IPMI Serial-over-LAN
ii  nvidia-kernel-common-470-server      470.103.01-0ubuntu0.20.04.1             amd64        Shared files used with the kernel module
ii  nvidia-kernel-defaults               21.05-1                                 all          sysctl default kernel settings for DGX.
ii  nvidia-kernel-source-470-server      470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA kernel source package
ii  nvidia-lldpd-defaults                21.05-1                                 all          lldpd defaults for Nvidia servers
ii  nvidia-logrotate                     21.11-1                                 all          NVIDIA logrotate policy
ii  nvidia-mig-manager                   0.1.2-1                                 amd64        NVIDIA MIG Partition Editor and Systemd Service
ii  nvidia-mlnx-config                   20.10.1                                 amd64        Configure the MLNX devices
ii  nvidia-motd                          21.03-1                                 all          Custom motd files for NVIDIA platforms
ii  nvidia-nvme-core-options             20.06-1                                 amd64        Modify nvme core options
ii  nvidia-nvme-smartd                   20.06-1                                 all          Enable SMART monitoring on NVME devices
ii  nvidia-oem-config-bmc                21.01-2                                 all          Ubiquity plugin to configure BMC on NVIDIA platforms
ii  nvidia-oem-config-crypt-passwd       21.01-2                                 all          Ubiquity plugin to reset crypt password
ii  nvidia-oem-config-eula               21.01-2                                 all          Ubiquity plugin to display EULA
ii  nvidia-oem-config-grub-passwd        21.01-2                                 all          Ubiquity plugin to configure GRUB password on NVIDIA platforms
ii  nvidia-oem-config-postact            21.01-2                                 all          Ubiquity plugin to complete final actions before booting
ii  nvidia-pci-bridge-power              21.11-1                                 amd64        Sets PCI bridge power control to on
ii  nvidia-peer-memory                   1.2-0-nvidia1                           all          nvidia peer memory kernel module.
ii  nvidia-peer-memory-dkms              1.2-0-nvidia1                           all          DKMS support for nvidia-peer-memory kernel modules
ii  nvidia-raid-config                   21.07-1                                 amd64        DGX RAID Configuration
ii  nvidia-redfish-config                20.10-1                                 all          Configure Redfish Host Interface
ii  nvidia-relaxed-ordering-gpu          20.10-1                                 amd64        Configure PCIe Relaxed Ordering
ii  nvidia-relaxed-ordering-nvme         20.10-1                                 amd64        Configure PCIe Relaxed Ordering
ii  nvidia-repo-keys                     20.06-1                                 amd64        Adds keys to apt trusted.gpg database
ii  nvidia-system-tools                  20.11-1                                 amd64        Metapackage for NVIDIA system tools stack
ii  nvidia-utils-470-server              470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA Server Driver support binaries
ii  xserver-xorg-video-nvidia-470-server 470.103.01-0ubuntu0.20.04.1             amd64        NVIDIA binary Xorg driver

Steps:

$ sudo snap install microk8s --classic
$ sudo microk8s.enable dns helm3
$ sudo microk8s.helm3 repo add nvidia https://nvidia.github.io/gpu-operator
$ sudo microk8s.helm3 repo update
$ cat gpu-operator-helm-chart-options.yaml
operator:
  defaultRuntime: containerd

driver:
  enabled: false

mig:
  strategy: mixed

toolkit:
  enabled: false
  env:
  - name: CONTAINERD_CONFIG
    value: /var/snap/microk8s/current/args/containerd-template.toml
  - name: CONTAINERD_SOCKET
    value: /var/snap/microk8s/common/run/containerd.sock

$ sudo microk8s.helm3 install --wait gpu-operator -n gpu-operator --create-namespace -f gpu-operator-helm-chart-options.yaml nvidia/gpu-operator
NAME: gpu-operator
LAST DEPLOYED: Mon Feb 14 19:07:44 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Then:

$ kubectl get pods -n gpu-operator
NAME                                                          READY   STATUS                  RESTARTS      AGE
gpu-operator-node-feature-discovery-master-5f6fb954cf-pzl54   1/1     Running                 0             3m15s
gpu-operator-node-feature-discovery-worker-vxx4w              1/1     Running                 0             3m15s
gpu-operator-7ff85f9c4f-6cggd                                 1/1     Running                 0             3m15s
nvidia-mig-manager-z87d5                                      1/1     Running                 0             2m56s
nvidia-operator-validator-p78bm                               0/1     Init:2/4                0             2m56s
nvidia-dcgm-exporter-fw6mm                                    0/1     CrashLoopBackOff        4 (86s ago)   2m56s
gpu-feature-discovery-rchnt                                   0/1     CrashLoopBackOff        4 (85s ago)   2m56s
nvidia-device-plugin-daemonset-vlmjb                          0/1     CrashLoopBackOff        4 (71s ago)   2m56s
nvidia-cuda-validator-ws464                                   0/1     Init:CrashLoopBackOff   4 (69s ago)   2m46s
$ kubectl logs -n gpu-operator pods/nvidia-device-plugin-daemonset-vlmjb
2022/02/14 19:11:12 Loading NVML
2022/02/14 19:11:12 Starting FS watcher.
2022/02/14 19:11:12 Starting OS watcher.
2022/02/14 19:11:12 Retreiving plugins.
2022/02/14 19:11:12 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-78dba802-c88e-5c2f-f8f1-1d6715d3b565

goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0xe25da8, 0x5, 0xac21c0, 0xe25da8)
	/build/cmd/nvidia-device-plugin/mig-strategy.go:171 +0x865
main.start(0xc0002e3040, 0x0, 0x0)
	/build/cmd/nvidia-device-plugin/main.go:149 +0x5bc
github.com/urfave/cli/v2.(*App).RunContext(0xc000466000, 0xac8e80, 0xc000028038, 0xc0000201d0, 0x1, 0x1, 0x0, 0x0)
	/build/vendor/github.com/urfave/cli/v2/app.go:315 +0x70d
github.com/urfave/cli/v2.(*App).Run(...)
	/build/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
	/build/cmd/nvidia-device-plugin/main.go:91 +0x5c5
$ kubectl logs -n gpu-operator pod/gpu-feature-discovery-rchnt
gpu-feature-discovery: 2022/02/14 19:13:48 Running gpu-feature-discovery in version v0.4.1
gpu-feature-discovery: 2022/02/14 19:13:48 Loaded configuration:
gpu-feature-discovery: 2022/02/14 19:13:48 Oneshot: false
gpu-feature-discovery: 2022/02/14 19:13:48 FailOnInitError: true
gpu-feature-discovery: 2022/02/14 19:13:48 SleepInterval: 1m0s
gpu-feature-discovery: 2022/02/14 19:13:48 MigStrategy: mixed
gpu-feature-discovery: 2022/02/14 19:13:48 NoTimestamp: false
gpu-feature-discovery: 2022/02/14 19:13:48 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd
gpu-feature-discovery: 2022/02/14 19:13:48 Start running
gpu-feature-discovery: 2022/02/14 19:13:48 Warning: Error removing output file: Failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/02/14 19:13:48 Unexpected error: Error generating NVML labels: Error generating common labels: Error getting device: nvml: Insufficient Permissions

But if I then do:

$ kubectl label node/dgxa100 nvidia.com/mig.config=all-disabled --overwrite
$ kubectl logs -f -n gpu-operator pod/nvidia-mig-manager-z87d5
time="2022-02-14T19:15:00Z" level=info msg="Updating to MIG config: all-disabled"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=failed'
Checking if the selected MIG config is currently applied or not
time="2022-02-14T19:15:00Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Persisting all-disabled to /etc/systemd/system/nvidia-mig-manager.service.d/override.conf
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
time="2022-02-14T19:15:01Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/dgxa100 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/dgxa100 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-vlmjb condition met
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Shutting down all GPU clients on the host by stopping their systemd services
Stopping nvsm.service (active, will-restart)
Skipping nvsm-mqtt.service (inactive, will-restart)
Skipping nvsm-core.service (inactive, will-restart)
Skipping nvsm-api-gateway.service (inactive, will-restart)
Skipping nvsm-notifier.service (inactive, will-restart)
Stopping nv_peer_mem.service (active, will-restart)
Stopping nvidia-dcgm.service (active, will-restart)
Skipping dcgm.service (disabled)
Skipping dcgm-exporter.service (no-exist)
Skipping kubelet.service (no-exist)
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-02-14T19:15:44Z" level=debug msg="Parsing config file..."
time="2022-02-14T19:15:44Z" level=debug msg="Selecting specific MIG config..."
time="2022-02-14T19:15:44Z" level=debug msg="Running apply-start hook"
time="2022-02-14T19:15:44Z" level=debug msg="Checking current MIG mode..."
time="2022-02-14T19:15:44Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="Running pre-apply-mode hook"
time="2022-02-14T19:15:44Z" level=debug msg="Applying MIG mode change..."
time="2022-02-14T19:15:44Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg="    Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg="    Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="At least one mode change pending"
time="2022-02-14T19:15:45Z" level=debug msg="Resetting GPUs..."
time="2022-02-14T19:15:45Z" level=debug msg="  NVIDIA kernel module loaded"
time="2022-02-14T19:15:45Z" level=debug msg="  Using nvidia-smi to perform GPU reset"
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-02-14T19:16:05Z" level=debug msg="Parsing config file..."
time="2022-02-14T19:16:05Z" level=debug msg="Selecting specific MIG config..."
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-start hook"
time="2022-02-14T19:16:05Z" level=debug msg="Checking current MIG mode..."
time="2022-02-14T19:16:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="    Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="    MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg="    Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="Checking current MIG device configuration..."
time="2022-02-14T19:16:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 0: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 1: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 2: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 3: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 4: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 5: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 6: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="  GPU 7: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting all GPU clients previously shutdown on the host by restarting their systemd services
Starting nvidia-dcgm.service
Starting nv_peer_mem.service
Starting nvsm-notifier.service
Starting nvsm-api-gateway.service
Starting nvsm-core.service
Starting nvsm-mqtt.service
Starting nvsm.service
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/dgxa100 labeled
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-p78bm" deleted
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/dgxa100 labeled
time="2022-02-14T19:17:43Z" level=info msg="Successfuly updated to MIG config: all-disabled"
time="2022-02-14T19:17:43Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"

And then GPUs pop up normally on the node and I can allocate and use them:

$ kubectl describe nodes
[...]
Capacity:
  cpu:                256
  ephemeral-storage:  1843217020Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             2113603344Ki
  nvidia.com/gpu:     8
  pods:               110
[...]

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 20 (7 by maintainers)

Most upvoted comments

NVIDIA_MIG_MONITOR_DEVICES=all

that’s works! ,thanks ! helps me a lot

I would only expect an error of Insufficient Permissions for gpu feature discovery if NVIDIA_MIG_MONITOR_DEVICES=all was not set as an environment variable when it was launched. As far as I know this should be set by the operator though.