gpu-operator: Centos 7. nvidia-driver pod "Could not resolve Linux kernel version"

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node? No. Centos 7.8
  • Are you running Kubernetes v1.13+? v1.18
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Docker 20.10.3
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I’ve error while nvidia-driver pod try to install driver on centos 7. this log is

========== NVIDIA Software Installer ==========
Starting installation of NVIDIA driver version 450.80.02 for Linux kernel version 3.10.0-862.el7.x86_64
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).Could not resolve Linux kernel version
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

I see it same error in #97. but i try to disable nouveau with following it not resolve. I’ve used gpu-operator v1.5.2. Please help me resolve this error. thanks.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 7
  • Comments: 18 (4 by maintainers)

Most upvoted comments

to better understand what script is running, could you tell us which image the driver Pod is running

kubectl get pods -n gpu-operator-resources -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{", "}{end}{end}' | grep nvidia-driver-daemonset

I guess it should be executing this script: https://gitlab.com/nvidia/container-images/driver/-/blob/master/centos7/nvidia-driver

    echo "Resolving Linux kernel version..."
    if [ -z "${version}" ]; then
        echo "Could not resolve Linux kernel version" >&2
        return 1
    fi

but the error message doesn’t says so:

Resolving Linux kernel version...
Unable to open the file '/lib/modules/3.10.0-862.el7.x86_64/proc/version' (No such file or directory).
Could not resolve Linux kernel version

My cluster runs on Centos 7.6 with upgraded kernel 4.19.12-1.el7,

# rpm -qa | grep kernel-ml
kernel-ml-4.19.12-1.el7.elrepo.x86_64

replace kernel to kernel-ml in nvidia-docker and re-build the image, by using modified image nvcr.io/nvidia/mldriver:460.32.03-centos7 I could get nvidia-driver-daemonset working.

# docker build -t  nvcr.io/nvidia/mldriver:460.32.03-centos7 .
# cat Dockerfile 
FROM nvcr.io/nvidia/driver:460.32.03-centos7
COPY nvidia-driver /usr/local/bin
# diff nvidia-driver nvidia-driver.orig 
27c27
<     local version=$(yum -q list available --show-duplicates kernel-ml-headers |
---
>     local version=$(yum -q list available --show-duplicates kernel-headers |
50,52c50,51
<     echo "Installing Linux kernel ml headers..."
<     rpm -e --nodeps kernel-headers
<     yum -q -y install kernel-ml-headers-${KERNEL_VERSION} kernel-ml-devel-${KERNEL_VERSION} > /dev/null
---
>     echo "Installing Linux kernel headers..."
>     yum -q -y install kernel-headers-${KERNEL_VERSION} kernel-devel-${KERNEL_VERSION} > /dev/null
56c55
<     curl -fsSL $(repoquery --location kernel-ml-${KERNEL_VERSION}) | rpm2cpio | cpio -idm --quiet
---
>     curl -fsSL $(repoquery --location kernel-${KERNEL_VERSION}) | rpm2cpio | cpio -idm --quiet
390a390
> 

After building image you have to manually replace tags in values.yml.