gpu-operator: Failed to initialize NVML: Unknown Error
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.
1. Quick Debug Checklist
- Are you running on an Ubuntu 18.04 node?
- Are you running Kubernetes v1.13+?
- Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
- Do you have
i2c_core
andipmi_msghandler
loaded on the nodes? - Did you apply the CRD (
kubectl describe clusterpolicies --all-namespaces
)
1. Issue or feature description
Hi, I’m deploying Kubeflow v1.6.1 along with nvidia/gpu-operator
for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot use nvidia-smi
to check GPU status anymore. When this happens, it raises:
(base) jovyan@agm-0:~/vol-1$ nvidia-smi
Failed to initialize NVML: Unknown Error
I’m not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?
2. Steps to reproduce the issue
This is how I deploy nvidia/gpu-operator
:
sudo snap install helm --classic
helm repo add nvidia https://nvidia.github.io/gpu-operator \
&& helm repo update \
&& helm install \
--version=v22.9.0 \
--generate-name \
--create-namespace \
--namespace=gpu-operator-resources \
nvidia/gpu-operator \
--set driver.enabled=false \
--set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
--set devicePlugin.env[0].value="volume-mounts" \
--set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
--set-string toolkit.env[0].value=false \
--set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
--set-string toolkit.env[1].value=true
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 27 (10 by maintainers)
Links to this issue
Commits related to this issue
- Create all /dev/char symlinks in driver validator The existence of these symlinks is required to address the following bug: https://github.com/NVIDIA/gpu-operator/issues/430 This bug impacts containe... — committed to NVIDIA/gpu-operator by deleted user a year ago
- Create all /dev/char symlinks in driver validator The existence of these symlinks is required to address the following bug: https://github.com/NVIDIA/gpu-operator/issues/430 This bug impacts containe... — committed to NVIDIA/gpu-operator by deleted user a year ago
I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward.
At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue.