gpu-operator: Operator breaks on hosts that already have drivers installed

For context, we’re hoping to use the operator with MicroK8s, in order to remove the need to package the nvidia-container-toolkit and template the containerd config file.

The issue we see is that, when the host already have the nvidia drivers installed, the nvidia-driver-daemonset pod fails to start. The logs contain:

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.4.0-1034-aws

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use

Would it be possible for the operator to check for the presence of existing drivers and use them instead of trying to re-install them?

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 7
Comments: 16 (7 by maintainers)

Most upvoted comments

@joedborg Yes, we have this planned for next release of driver.

shivamerla on Jan 20, 2021

Sorry, release has been pushed out by two weeks, due to addition of other important features required for MIG configurations.

shivamerla on May 5, 2021

Hello @shivamerla, are there any updates on this? There have been two gpu-operator releases since your comment (v1.5.1 and v1.5.2) but the issue persists. Is there an estimated date for this? Alternatively, is there a workaround or hack we can deploy temporarily until the fix? I am blocked by this issue for the same reasons as @joedborg, trying to run gpu-operator with K3s/MicroK8s on my dev machine with the drivers preinstalled which I need for display.

0x6b756d6172 on Feb 14, 2021

@shpwrck @jabstone sorry for the delay. Since there are no official images yet with these changes, you need to use CI builds to try this out.

Clone gpu-operator repo and change directory to it.
Un-install any previous versions of operator and reboot the host if driver was installed using operator and if nvidia kernel modules are still loaded.
Delete old CRD using kubectl delete crd clusterpolicies.nvidia.com
Install new chart using below options helm install gpu-operator deployments/gpu-operator --set operator.registry=registry.gitlab.com/nvidia/kubernetes --set operator.version=1.6.2-31-g2345a5c --set toolkit.registry=registry.gitlab.com/nvidia/container-toolkit/container-config/staging --toolkit.version=22225e5d-ubuntu18.04 --set driver.enabled=false
For containerd, please also pass options --set operator.defaultRuntime=containerd.

Please let me know if you face any issues.

shivamerla on Apr 15, 2021

@jabstone We are currently testing changes for this: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/202

Since this has dependency on container-toolkit release as well, I will work with PM’s to get the ETA for releasing this.

shivamerla on Apr 1, 2021