amazon-eks-ami: Some instance types using incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227

What happened:

I run a p3.2xlarge node group in my 1.29 EKS cluster. I updated the node group’s AMI image to AMI ID ami-07c8bc6b0bb890e9e (amazon-eks-gpu-node-1.29-v20240227). After the update I was unable to deploy my CUDA containers to the node. I ssh’d into the node and found nvidia-smi couldn’t communicate with the GPU:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

What you expected to happen:

Should be able to communicate with the Tesla GPU without manual intervention

How to reproduce it (as minimally and precisely as possible):

Deploy a p3.2xlarge node on a 1.29 cluster using the latest AMI image.

Anything else we need to know?:

Environment:

AWS Region: us-east-2
Instance Type(s): p3.2xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.29
AMI Version: amazon-eks-gpu-node-1.29-v20240227
Kernel (e.g. uname -a): Linux ip-10-20-40-96.us-east-2.compute.internal 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-0e3ec26ca86336aea"
BUILD_TIME="Tue Feb 27 23:54:40 UTC 2024"
BUILD_KERNEL="5.10.209-198.858.amzn2.x86_64"
ARCH="x86_64"

Everything should work out of the box, but I can manually fix this by removing the default nvidia-dkms files and reinstalling the dkms module for the stated version of the nvidia driver this latest AMI version purportedly supports:

sudo rm -r /var/lib/dkms/nvidia
sudo dkms install nvidia/535.161.07 --force

Then if I run nvidia-smi I get:

Fri Mar  1 04:33:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   24C    P0              38W / 300W |      0MiB / 16384MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

About this issue

Original URL
State: closed
Created 4 months ago
Reactions: 3
Comments: 16 (5 by maintainers)

Most upvoted comments

Both of the issues mentioned here (incorrect NVIDIA kmod being loaded, race condition between configure-nvidia.service and bootstrap.sh) should be resolved in the latest AMI release, v20240409. 👍

cartermckinnon on Apr 11, 2024

This is probably a better workaround for now. Basically I’m taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml knowing that the bootstrap process uses it.

Note I’m setting discard_unpacked_layers to false for my use case which helps with making sure the ! cmp -s ... in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for the sandbox_image, which I think would need to be updated based on these docs.

---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ...
  userData: |
    cat <<EOF > /etc/eks/containerd/containerd-config.toml
    imports = ["/etc/containerd/config.d/*.toml"]
    root = "/var/lib/containerd"
    state = "/run/containerd"
    version = 2

    [grpc]
      address = "/run/containerd/containerd.sock"

    [plugins]

      [plugins."io.containerd.grpc.v1.cri"]
        sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

        [plugins."io.containerd.grpc.v1.cri".cni]
          bin_dir = "/opt/cni/bin"
          conf_dir = "/etc/cni/net.d"

        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "nvidia"
          discard_unpacked_layers = false

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
                SystemdCgroup = true

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".registry]
          config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
    EOF

Ideally I’d be able to build a GPU AMI with a modified bootstrap.sh script, but I can’t figure out where the GPU AMIs are coming from. Doesn’t seem like they’re open source?

nickpetrovic on Mar 18, 2024

I did more testing and found that my workaround accidentally and incorrectly fixes the issue.

What I think is really happening is that the configure-nvidia.service is completing before the bootstrap.sh process gets to this code here.

  if ! cmp -s /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml; then
    sudo cp -v /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml
    sudo cp -v /etc/eks/containerd/sandbox-image.service /etc/systemd/system/sandbox-image.service
    sudo chown root:root /etc/systemd/system/sandbox-image.service
    systemctl daemon-reload
    systemctl enable containerd sandbox-image
    systemctl restart sandbox-image containerd
  fi

The configure-nvidia service sets nvidia as the runtime in /etc/containerd/config.toml, but because it finishes before the bootstrap process, the bootstrap process overwrites this file because its different than /etc/eks/containerd/containerd-config.toml.

So how does my workaround “fix” this? From what I can tell the || true actually causes the configure-nvidia service to fail after its first startup (not sure why exactly). Then configure-nvidia is started again (not sure how either). This happens when the sandbox-image service is being restarted in the bootstrap process. And because sandbox-image service takes a bit to restart, the new containerd config is in place by now, and finally the containerd service is restarted right after sandbox-image in the bootstrap process.

I’ve stitched together logs from my observations. The “current.txt” doesn’t have my extra userdata.

nickpetrovic on Mar 16, 2024

I’ve run into the same issue as @korjek. I’m on EKS 1.29 with AMI amazon-eks-gpu-node-1.29-v20240307.

It appears the containerd config.toml is not being updated to use the nvidia runtime. I found the configure-nvidia.service and its corresponding script then tried to run it which gave me this output.

/etc/eks/configure-nvidia.sh
+ gpu-ami-util has-nvidia-devices
true
+ /etc/eks/nvidia-kmod-load.sh
true
0x2237 NVIDIA A10G
Disabling GSP for instance type: g5.xlarge
2024-03-15T21:59:42+0000 [kmod-util] unpacking: nvidia-open
Error! nvidia-open-535.161.07 is already added!
Aborting.

As a workaround, I patched /etc/eks/configure-nvidia.sh in my Karpenter userdata like so.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ... bunch of other stuff
  userData: |
    cat <<EOF > /etc/eks/configure-nvidia.sh
    #!/usr/bin/env bash

    set -o errexit
    set -o nounset
    set -o xtrace

    if ! gpu-ami-util has-nvidia-devices; then
      echo >&2 "no NVIDIA devices are present, nothing to do!"
      exit 0
    fi

    # patched with "|| true" to avoid failing on startup
    /etc/eks/nvidia-kmod-load.sh || true

    # add 'nvidia' runtime to containerd config, and set it as the default
    # otherwise, all Pods need to speciy the runtimeClassName
    nvidia-ctk runtime configure --runtime=containerd --set-as-default
    EOF

nickpetrovic on Mar 15, 2024

This issue should be fixed in https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240307. What release are you using?

cartermckinnon on Mar 14, 2024