gpu-operator: error code CUDA driver version is insufficient for CUDA runtime version in v22.9.0

The issue is still reproduced in gpu-operator v22.9.0.

kubectl --kubeconfig -n gpu logs cuda-vectoradd

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Environment infomation

OS Version: Red Hat Enterprise Linux release 8.4
kernel: 4.18.0-305.el8.x86_64

K3S Version: v1.24.3+k3s1

GPU Operator Version: v22.9.0
CUDA Version: 11.7.1-base-ubi8

Driver Pre-installed: No
Driver Version:515.65.01-rhel8.4

Container-Toolkit Pre-installed: No
Container-Toolkit Version: v1.11.0-ubi8

GPU Type: Tesla P100
cuda-sample: cuda-sample:vectoradd-cuda11.7.1-ubi8

config.toml content cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

In the host, the /etc/nvidia-container-runtime/host-files-for-container.d is not found.

cuda-vectoradd pod yaml

cat << EOF | kubectl --kubeconfig /work/k3s.yaml create -n hsc-gpu -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia  <<<<<<<<<
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

when I add runtimeClassName: nvidia in Pod spec, it works.

issue: https://github.com/NVIDIA/gpu-operator/issues/408

Dose gpu-operator support on k3s cluster environment?

@shivamerla @cdesiniotis Could you please help me out? Thank you very much.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

I haved the same error with the Gpu Operator example but If I try with following example all works fine

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: compute,utility
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1