gpu-operator: Operator breaks on hosts that already have drivers installed
For context, we’re hoping to use the operator with MicroK8s, in order to remove the need to package the nvidia-container-toolkit and template the containerd config file.
The issue we see is that, when the host already have the nvidia drivers installed, the nvidia-driver-daemonset
pod fails to start. The logs contain:
========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 5.4.0-1034-aws
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Could not unload NVIDIA driver kernel modules, driver is in use
Would it be possible for the operator to check for the presence of existing drivers and use them instead of trying to re-install them?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 7
- Comments: 16 (7 by maintainers)
@joedborg Yes, we have this planned for next release of driver.
Sorry, release has been pushed out by two weeks, due to addition of other important features required for MIG configurations.
Hello @shivamerla, are there any updates on this? There have been two gpu-operator releases since your comment (v1.5.1 and v1.5.2) but the issue persists. Is there an estimated date for this? Alternatively, is there a workaround or hack we can deploy temporarily until the fix? I am blocked by this issue for the same reasons as @joedborg, trying to run gpu-operator with K3s/MicroK8s on my dev machine with the drivers preinstalled which I need for display.
@shpwrck @jabstone sorry for the delay. Since there are no official images yet with these changes, you need to use CI builds to try this out.
nvidia
kernel modules are still loaded.kubectl delete crd clusterpolicies.nvidia.com
helm install gpu-operator deployments/gpu-operator --set operator.registry=registry.gitlab.com/nvidia/kubernetes --set operator.version=1.6.2-31-g2345a5c --set toolkit.registry=registry.gitlab.com/nvidia/container-toolkit/container-config/staging --toolkit.version=22225e5d-ubuntu18.04 --set driver.enabled=false
--set operator.defaultRuntime=containerd
.Please let me know if you face any issues.
@jabstone We are currently testing changes for this: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/202
Since this has dependency on container-toolkit release as well, I will work with PM’s to get the ETA for releasing this.