gpu-operator: Failed to initialize NVML: Unknown Error

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn’t apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Hi, I’m deploying Kubeflow v1.6.1 along with nvidia/gpu-operator for training DL models. It works great, but after a random of time (maybe 1-2 days I guess), I cannot use nvidia-smi to check GPU status anymore. When this happens, it raises:

(base) jovyan@agm-0:~/vol-1$ nvidia-smi
Failed to initialize NVML: Unknown Error

I’m not so sure why this happens because it runs training without any problem for several epochs, and when I come back the next day, this error happens. Do you have any idea?

2. Steps to reproduce the issue

This is how I deploy nvidia/gpu-operator:

sudo snap install helm --classic
helm repo add nvidia https://nvidia.github.io/gpu-operator \
  && helm repo update \
  && helm install \
  --version=v22.9.0 \
  --generate-name \
  --create-namespace \
  --namespace=gpu-operator-resources \
  nvidia/gpu-operator \
  --set driver.enabled=false \
  --set devicePlugin.env[0].name=DEVICE_LIST_STRATEGY \
  --set devicePlugin.env[0].value="volume-mounts" \
  --set toolkit.env[0].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED \
  --set-string toolkit.env[0].value=false \
  --set toolkit.env[1].name=ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS \
  --set-string toolkit.env[1].value=true

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 27 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward.

At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue.