minikube: nvidia-driver-installer addon fails to start (driver fails to install in the container)

The exact command to reproduce the issue: Following the instructions on https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/

Host: Fedora 31 Kernel: 5.5.7-200.fc31.x86_64 Cuda Drivers Host: 10.2 Nvidia Driver Host: Driver Version: 440.64

GPUs:

  • Titan V (vfio-pci driver assigned)
  • Geforce 1080TI (host GPU nvidia driver)

Minikube start: minikube start --vm-driver kvm2 --kvm-gpu --cpus=12 --memory=25480

Minikube Addons: minikube addons enable nvidia-gpu-device-plugin This will fail initially, just edit the dc/nvidia-gpu-device-plugin and increase the mem to 100mi and it’ll start fine.

The full output of the command that failed:

minikube addons enable nvidia-driver-installer

The container fails to start. and once you fetch teh logs, it will return:

Configuring kernel sources... DONE
Running Nvidia installer...
/usr/local/nvidia /
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 390.67.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: An error occurred while performing the step: "Building kernel
       modules". See /var/log/nvidia-installer.log for details.


ERROR: An error occurred while performing the step: "Checking to see
       whether the nvidia-drm kernel module was successfully built". See
       /var/log/nvidia-installer.log for details.


ERROR: The nvidia-drm kernel module was not created.


ERROR: The nvidia-drm kernel module failed to build. This kernel module is
       required for the proper operation of DRM-KMS. If you do not need to
       use DRM-KMS, you can try to install this driver package again with
       the '--no-drm' option.


ERROR: Installation has failed.  Please see the file
       '/usr/local/nvidia/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

The output of the minikube logs command:

The minikube logs don’t say much but here they are: minikube-logs.txt

Logs from the nvidia-installer itself: nvidia-driver-installer-logs.txt

The operating system version:

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 24 (2 by maintainers)

Most upvoted comments

I’m setting up minikube with the kvm2 driver and gpu passthrough as per https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/. GPU passthrough is available to the VM (the PCI-E devices are available as hardware to the minikube VM), but I’m failing to install the nvidia drivers inside the VM via

minikube addons enable nvidia-gpu-device-plugin
minikube addons enable nvidia-driver-installer

The nvidia-gpu-device-plugin pod is listed as running, whereas the second pod is stuck on Init (paused). I’m guessing this is the same issue @Nick-Harvey encountered.

kubectl get pods -n kube-system

NAME                               READY   STATUS     RESTARTS   AGE
coredns-66bff467f8-562mh           1/1     Running    0          29m
coredns-66bff467f8-kcxhg           1/1     Running    0          29m
etcd-minikube                      1/1     Running    0          28m
kube-apiserver-minikube            1/1     Running    0          28m
kube-controller-manager-minikube   1/1     Running    0          28m
kube-proxy-mb5lf                   1/1     Running    0          29m
kube-scheduler-minikube            1/1     Running    0          28m
nvidia-driver-installer-c5bvd      0/1     Init:0/1   2          3m22s
nvidia-gpu-device-plugin-57n2l     1/1     Running    0          28m
storage-provisioner                1/1     Running    0          28m

kubectl logs nvidia-gpu-device-plugin-57n2l -n kube-system

2020/05/26 14:26:03 Failed to initialize NVML: could not load NVML library.
2020/05/26 14:26:03 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/05/26 14:26:03 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/05/26 14:26:03 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

Some additional info on the pods:

`kubectl describe po nvidia-gpu-device-plugin-57n2l -n kube-system`
Name:                 nvidia-gpu-device-plugin-57n2l
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 minikube/192.168.39.201
Start Time:           Tue, 26 May 2020 16:25:57 +0200
Labels:               controller-revision-hash=7f89b4b55b
                      k8s-app=nvidia-gpu-device-plugin
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   172.17.0.4
IPs:
  IP:           172.17.0.4
Controlled By:  DaemonSet/nvidia-gpu-device-plugin
Containers:
  nvidia-gpu-device-plugin:
    Container ID:  docker://4babe01395d832acec7db0c6b449da25b5c8dd1114ed0e91c7d80f1255510f44
    Image:         nvidia/k8s-device-plugin:1.0.0-beta4
    Image ID:      docker-pullable://nvidia/k8s-device-plugin@sha256:94d46bf513cbc43c4d77a364e4bbd409d32d89c8e686e12551cc3eb27c259b90
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/nvidia-device-plugin
      -logtostderr
    State:          Running
      Started:      Tue, 26 May 2020 16:26:03 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        50m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pvcvh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  default-token-pvcvh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pvcvh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoExecute
                 :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  34m   default-scheduler  Successfully assigned kube-system/nvidia-gpu-device-plugin-57n2l to minikube
  Normal  Pulling    34m   kubelet, minikube  Pulling image "nvidia/k8s-device-plugin:1.0.0-beta4"
  Normal  Pulled     34m   kubelet, minikube  Successfully pulled image "nvidia/k8s-device-plugin:1.0.0-beta4"
  Normal  Created    34m   kubelet, minikube  Created container nvidia-gpu-device-plugin
  Normal  Started    34m   kubelet, minikube  Started container nvidia-gpu-device-plugin
`kubectl describe po nvidia-driver-installer-c5bvd -n kube-system`
Name:         nvidia-driver-installer-c5bvd
Namespace:    kube-system
Priority:     0
Node:         minikube/192.168.39.201
Start Time:   Tue, 26 May 2020 16:51:20 +0200
Labels:       controller-revision-hash=db985bcbc
              k8s-app=nvidia-driver-installer
              pod-template-generation=1
Annotations:  <none>
Status:       Pending
IP:           172.17.0.7
IPs:
  IP:           172.17.0.7
Controlled By:  DaemonSet/nvidia-driver-installer
Init Containers:
  nvidia-driver-installer:
    Container ID:   docker://6e972a48ef014a50b13a12a60a6d0179512d191c24170a5034a04f213eeb8809
    Image:          k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342
    Image ID:       docker-pullable://k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 May 2020 16:53:47 +0200
      Finished:     Tue, 26 May 2020 16:54:47 +0200
    Ready:          False
    Restart Count:  2
    Requests:
      cpu:  150m
    Environment:
      NVIDIA_INSTALL_DIR_HOST:       /home/kubernetes/bin/nvidia
      NVIDIA_INSTALL_DIR_CONTAINER:  /usr/local/nvidia
      ROOT_MOUNT_DIR:                /root
    Mounts:
      /dev from dev (rw)
      /root from root-mount (rw)
      /usr/local/nvidia from nvidia-install-dir-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pvcvh (ro)
Containers:
  pause:
    Container ID:   
    Image:          k8s.gcr.io/pause:2.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pvcvh (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  nvidia-install-dir-host:
    Type:          HostPath (bare host directory volume)
    Path:          /home/kubernetes/bin/nvidia
    HostPathType:  
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  default-token-pvcvh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pvcvh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  4m12s                default-scheduler  Successfully assigned kube-system/nvidia-driver-installer-c5bvd to minikube
  Normal   Pulling    4m11s                kubelet, minikube  Pulling image "k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342"
  Normal   Pulled     4m4s                 kubelet, minikube  Successfully pulled image "k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342"
  Normal   Created    105s (x3 over 4m4s)  kubelet, minikube  Created container nvidia-driver-installer
  Normal   Started    105s (x3 over 4m4s)  kubelet, minikube  Started container nvidia-driver-installer
  Normal   Pulled     105s (x2 over 3m2s)  kubelet, minikube  Container image "k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342" already present on machine
  Warning  BackOff    42s (x4 over 2m)     kubelet, minikube  Back-off restarting failed container

Any suggestions on how to debug this further?

Hello, I’ve got similar issue: looks like addon uses very old NVIDIA GPU driver 390.67 which doesn’t support RTX 3090/3080. How can I install the addon with the new NVIDIA driver?