gpu-operator: Error when trying to use operator on DGX A100-80GB with microk8s and mixed strategy MIG
1. Issue or feature description
On a DGX A100-80GB, trying to install the operator with mixed strategy MIG, feature discovery/node labeling work fine with MIG disabled, but as soon as I set a MIG config label on the node and mig-manager reconfigures the GPUs, discovery and labeling of GPUs stop working.
2. Steps to reproduce the issue
System info:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
$ dpkg -l | grep nvidia
ii libnvidia-cfg1-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-470-server 470.103.01-0ubuntu0.20.04.1 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.7.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.7.0-1 amd64 NVIDIA container runtime library
ii libnvidia-decode-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVENC Video Encoding runtime library
ii libnvidia-extra-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 Extra libraries for the NVIDIA Server Driver
ii libnvidia-fbc1-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-470-server:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii nvidia-acs-disable 19.12.0 amd64 Disables the PCIe ACS capability
ii nvidia-compute-utils-470-server 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA compute utilities
ii nvidia-conf-cachefilesd 20.06-1 amd64 Systemd settings for cachefilesd
ii nvidia-container-runtime 3.7.0-1 all NVIDIA container runtime
ii nvidia-container-toolkit 1.7.0-1 amd64 NVIDIA container runtime hook
ii nvidia-crashdump 20.12-1 amd64 NVIDA crash dump policy
ii nvidia-dcgm-enable 21.07-1 all Enable DCGM
ii nvidia-disable-iscsid 20.06-1 all Disable iscsid on NVIDIA platforms that don't support it
ii nvidia-dkms-470-server 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA DKMS package
ii nvidia-driver-470-server 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA Server Driver metapackage
ii nvidia-enable-journaling 20.06-1 all Package that enables journal_data on root file system
ii nvidia-fabricmanager-470 470.103.01-0ubuntu0.20.04.1 amd64 Fabric Manager for NVSwitch based systems.
ii nvidia-icmp 20.06-1 amd64 DGX iptable settings
ii nvidia-ipmisol 21.01-1 amd64 Enable IPMI Serial-over-LAN
ii nvidia-kernel-common-470-server 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
ii nvidia-kernel-defaults 21.05-1 all sysctl default kernel settings for DGX.
ii nvidia-kernel-source-470-server 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA kernel source package
ii nvidia-lldpd-defaults 21.05-1 all lldpd defaults for Nvidia servers
ii nvidia-logrotate 21.11-1 all NVIDIA logrotate policy
ii nvidia-mig-manager 0.1.2-1 amd64 NVIDIA MIG Partition Editor and Systemd Service
ii nvidia-mlnx-config 20.10.1 amd64 Configure the MLNX devices
ii nvidia-motd 21.03-1 all Custom motd files for NVIDIA platforms
ii nvidia-nvme-core-options 20.06-1 amd64 Modify nvme core options
ii nvidia-nvme-smartd 20.06-1 all Enable SMART monitoring on NVME devices
ii nvidia-oem-config-bmc 21.01-2 all Ubiquity plugin to configure BMC on NVIDIA platforms
ii nvidia-oem-config-crypt-passwd 21.01-2 all Ubiquity plugin to reset crypt password
ii nvidia-oem-config-eula 21.01-2 all Ubiquity plugin to display EULA
ii nvidia-oem-config-grub-passwd 21.01-2 all Ubiquity plugin to configure GRUB password on NVIDIA platforms
ii nvidia-oem-config-postact 21.01-2 all Ubiquity plugin to complete final actions before booting
ii nvidia-pci-bridge-power 21.11-1 amd64 Sets PCI bridge power control to on
ii nvidia-peer-memory 1.2-0-nvidia1 all nvidia peer memory kernel module.
ii nvidia-peer-memory-dkms 1.2-0-nvidia1 all DKMS support for nvidia-peer-memory kernel modules
ii nvidia-raid-config 21.07-1 amd64 DGX RAID Configuration
ii nvidia-redfish-config 20.10-1 all Configure Redfish Host Interface
ii nvidia-relaxed-ordering-gpu 20.10-1 amd64 Configure PCIe Relaxed Ordering
ii nvidia-relaxed-ordering-nvme 20.10-1 amd64 Configure PCIe Relaxed Ordering
ii nvidia-repo-keys 20.06-1 amd64 Adds keys to apt trusted.gpg database
ii nvidia-system-tools 20.11-1 amd64 Metapackage for NVIDIA system tools stack
ii nvidia-utils-470-server 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA Server Driver support binaries
ii xserver-xorg-video-nvidia-470-server 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA binary Xorg driver
Steps:
$ sudo snap install microk8s --classic
$ sudo microk8s.enable dns helm3
$ sudo microk8s.helm3 repo add nvidia https://nvidia.github.io/gpu-operator
$ sudo microk8s.helm3 repo update
$ cat gpu-operator-helm-chart-options.yaml
operator:
defaultRuntime: containerd
driver:
enabled: false
mig:
strategy: mixed
toolkit:
enabled: false
env:
- name: CONTAINERD_CONFIG
value: /var/snap/microk8s/current/args/containerd-template.toml
- name: CONTAINERD_SOCKET
value: /var/snap/microk8s/common/run/containerd.sock
$ sudo microk8s.helm3 install --wait gpu-operator -n gpu-operator --create-namespace -f gpu-operator-helm-chart-options.yaml nvidia/gpu-operator
NAME: gpu-operator
LAST DEPLOYED: Mon Feb 14 19:07:44 2022
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
Then:
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-node-feature-discovery-master-5f6fb954cf-pzl54 1/1 Running 0 3m15s
gpu-operator-node-feature-discovery-worker-vxx4w 1/1 Running 0 3m15s
gpu-operator-7ff85f9c4f-6cggd 1/1 Running 0 3m15s
nvidia-mig-manager-z87d5 1/1 Running 0 2m56s
nvidia-operator-validator-p78bm 0/1 Init:2/4 0 2m56s
nvidia-dcgm-exporter-fw6mm 0/1 CrashLoopBackOff 4 (86s ago) 2m56s
gpu-feature-discovery-rchnt 0/1 CrashLoopBackOff 4 (85s ago) 2m56s
nvidia-device-plugin-daemonset-vlmjb 0/1 CrashLoopBackOff 4 (71s ago) 2m56s
nvidia-cuda-validator-ws464 0/1 Init:CrashLoopBackOff 4 (69s ago) 2m46s
$ kubectl logs -n gpu-operator pods/nvidia-device-plugin-daemonset-vlmjb
2022/02/14 19:11:12 Loading NVML
2022/02/14 19:11:12 Starting FS watcher.
2022/02/14 19:11:12 Starting OS watcher.
2022/02/14 19:11:12 Retreiving plugins.
2022/02/14 19:11:12 Shutdown of NVML returned: <nil>
panic: At least one device with migEnabled=true was not configured correctly: No MIG devices associated with /dev/nvidia0: GPU-78dba802-c88e-5c2f-f8f1-1d6715d3b565
goroutine 1 [running]:
main.(*migStrategyMixed).GetPlugins(0xe25da8, 0x5, 0xac21c0, 0xe25da8)
/build/cmd/nvidia-device-plugin/mig-strategy.go:171 +0x865
main.start(0xc0002e3040, 0x0, 0x0)
/build/cmd/nvidia-device-plugin/main.go:149 +0x5bc
github.com/urfave/cli/v2.(*App).RunContext(0xc000466000, 0xac8e80, 0xc000028038, 0xc0000201d0, 0x1, 0x1, 0x0, 0x0)
/build/vendor/github.com/urfave/cli/v2/app.go:315 +0x70d
github.com/urfave/cli/v2.(*App).Run(...)
/build/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
/build/cmd/nvidia-device-plugin/main.go:91 +0x5c5
$ kubectl logs -n gpu-operator pod/gpu-feature-discovery-rchnt
gpu-feature-discovery: 2022/02/14 19:13:48 Running gpu-feature-discovery in version v0.4.1
gpu-feature-discovery: 2022/02/14 19:13:48 Loaded configuration:
gpu-feature-discovery: 2022/02/14 19:13:48 Oneshot: false
gpu-feature-discovery: 2022/02/14 19:13:48 FailOnInitError: true
gpu-feature-discovery: 2022/02/14 19:13:48 SleepInterval: 1m0s
gpu-feature-discovery: 2022/02/14 19:13:48 MigStrategy: mixed
gpu-feature-discovery: 2022/02/14 19:13:48 NoTimestamp: false
gpu-feature-discovery: 2022/02/14 19:13:48 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd
gpu-feature-discovery: 2022/02/14 19:13:48 Start running
gpu-feature-discovery: 2022/02/14 19:13:48 Warning: Error removing output file: Failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/02/14 19:13:48 Unexpected error: Error generating NVML labels: Error generating common labels: Error getting device: nvml: Insufficient Permissions
But if I then do:
$ kubectl label node/dgxa100 nvidia.com/mig.config=all-disabled --overwrite
$ kubectl logs -f -n gpu-operator pod/nvidia-mig-manager-z87d5
time="2022-02-14T19:15:00Z" level=info msg="Updating to MIG config: all-disabled"
Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
Current value of 'nvidia.com/gpu.deploy.dcgm=true'
Asserting that the requested configuration is present in the configuration file
Selected MIG configuration is valid
Getting current value of the 'nvidia.com/mig.config.state' node label
Current value of 'nvidia.com/mig.config.state=failed'
Checking if the selected MIG config is currently applied or not
time="2022-02-14T19:15:00Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Persisting all-disabled to /etc/systemd/system/nvidia-mig-manager.service.d/override.conf
Checking if the MIG mode setting in the selected config is currently applied or not
If the state is 'rebooting', we expect this to always return true
time="2022-02-14T19:15:01Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
Changing the 'nvidia.com/mig.config.state' node label to 'pending'
node/dgxa100 labeled
Shutting down all GPU clients in Kubernetes by disabling their component-specific nodeSelector labels
node/dgxa100 labeled
Waiting for the device-plugin to shutdown
pod/nvidia-device-plugin-daemonset-vlmjb condition met
Waiting for gpu-feature-discovery to shutdown
Waiting for dcgm-exporter to shutdown
Waiting for dcgm to shutdown
Shutting down all GPU clients on the host by stopping their systemd services
Stopping nvsm.service (active, will-restart)
Skipping nvsm-mqtt.service (inactive, will-restart)
Skipping nvsm-core.service (inactive, will-restart)
Skipping nvsm-api-gateway.service (inactive, will-restart)
Skipping nvsm-notifier.service (inactive, will-restart)
Stopping nv_peer_mem.service (active, will-restart)
Stopping nvidia-dcgm.service (active, will-restart)
Skipping dcgm.service (disabled)
Skipping dcgm-exporter.service (no-exist)
Skipping kubelet.service (no-exist)
Applying the MIG mode change from the selected config to the node
If the -r option was passed, the node will be automatically rebooted if this is not successful
time="2022-02-14T19:15:44Z" level=debug msg="Parsing config file..."
time="2022-02-14T19:15:44Z" level=debug msg="Selecting specific MIG config..."
time="2022-02-14T19:15:44Z" level=debug msg="Running apply-start hook"
time="2022-02-14T19:15:44Z" level=debug msg="Checking current MIG mode..."
time="2022-02-14T19:15:44Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:15:44Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg="Running pre-apply-mode hook"
time="2022-02-14T19:15:44Z" level=debug msg="Applying MIG mode change..."
time="2022-02-14T19:15:44Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:15:44Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg=" GPU 1: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg=" GPU 2: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg=" GPU 3: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:44Z" level=debug msg=" GPU 4: 0x20B210DE"
time="2022-02-14T19:15:44Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:44Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:44Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg=" GPU 5: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg=" GPU 6: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg=" GPU 7: 0x20B210DE"
time="2022-02-14T19:15:45Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:15:45Z" level=debug msg=" Current MIG mode: Enabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Updating MIG mode: Disabled"
time="2022-02-14T19:15:45Z" level=debug msg=" Mode change pending: true"
time="2022-02-14T19:15:45Z" level=debug msg="At least one mode change pending"
time="2022-02-14T19:15:45Z" level=debug msg="Resetting GPUs..."
time="2022-02-14T19:15:45Z" level=debug msg=" NVIDIA kernel module loaded"
time="2022-02-14T19:15:45Z" level=debug msg=" Using nvidia-smi to perform GPU reset"
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Applying the selected MIG config to the node
time="2022-02-14T19:16:05Z" level=debug msg="Parsing config file..."
time="2022-02-14T19:16:05Z" level=debug msg="Selecting specific MIG config..."
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-start hook"
time="2022-02-14T19:16:05Z" level=debug msg="Checking current MIG mode..."
time="2022-02-14T19:16:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 1: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 2: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 3: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 4: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 5: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 6: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 7: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg=" MIG capable: true\n"
time="2022-02-14T19:16:05Z" level=debug msg=" Current MIG mode: Disabled"
time="2022-02-14T19:16:05Z" level=debug msg="Checking current MIG device configuration..."
time="2022-02-14T19:16:05Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 0: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 1: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 2: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 3: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 4: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 5: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 6: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg=" GPU 7: 0x20B210DE"
time="2022-02-14T19:16:05Z" level=debug msg="Running apply-exit hook"
MIG configuration applied successfully
Restarting all GPU clients previously shutdown on the host by restarting their systemd services
Starting nvidia-dcgm.service
Starting nv_peer_mem.service
Starting nvsm-notifier.service
Starting nvsm-api-gateway.service
Starting nvsm-core.service
Starting nvsm-mqtt.service
Starting nvsm.service
Restarting all GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/dgxa100 labeled
Restarting validator pod to re-run all validations
pod "nvidia-operator-validator-p78bm" deleted
Changing the 'nvidia.com/mig.config.state' node label to 'success'
node/dgxa100 labeled
time="2022-02-14T19:17:43Z" level=info msg="Successfuly updated to MIG config: all-disabled"
time="2022-02-14T19:17:43Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
And then GPUs pop up normally on the node and I can allocate and use them:
$ kubectl describe nodes
[...]
Capacity:
cpu: 256
ephemeral-storage: 1843217020Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2113603344Ki
nvidia.com/gpu: 8
pods: 110
[...]
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (7 by maintainers)
that’s works! ,thanks ! helps me a lot
I would only expect an error of
Insufficient Permissions
for gpu feature discovery ifNVIDIA_MIG_MONITOR_DEVICES=all
was not set as an environment variable when it was launched. As far as I know this should be set by the operator though.