amazon-eks-ami: Some instance types using incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227

What happened:

I run a p3.2xlarge node group in my 1.29 EKS cluster. I updated the node group’s AMI image to AMI ID ami-07c8bc6b0bb890e9e (amazon-eks-gpu-node-1.29-v20240227). After the update I was unable to deploy my CUDA containers to the node. I ssh’d into the node and found nvidia-smi couldn’t communicate with the GPU:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

What you expected to happen:

Should be able to communicate with the Tesla GPU without manual intervention

How to reproduce it (as minimally and precisely as possible):

Deploy a p3.2xlarge node on a 1.29 cluster using the latest AMI image.

Anything else we need to know?:

Environment:

  • AWS Region: us-east-2
  • Instance Type(s): p3.2xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.29
  • AMI Version: amazon-eks-gpu-node-1.29-v20240227
  • Kernel (e.g. uname -a): Linux ip-10-20-40-96.us-east-2.compute.internal 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0e3ec26ca86336aea"
BUILD_TIME="Tue Feb 27 23:54:40 UTC 2024"
BUILD_KERNEL="5.10.209-198.858.amzn2.x86_64"
ARCH="x86_64"

Everything should work out of the box, but I can manually fix this by removing the default nvidia-dkms files and reinstalling the dkms module for the stated version of the nvidia driver this latest AMI version purportedly supports:

sudo rm -r /var/lib/dkms/nvidia
sudo dkms install nvidia/535.161.07 --force

Then if I run nvidia-smi I get:

Fri Mar  1 04:33:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:1E.0 Off |                    0 |
| N/A   24C    P0              38W / 300W |      0MiB / 16384MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Reactions: 3
  • Comments: 16 (5 by maintainers)

Most upvoted comments

Both of the issues mentioned here (incorrect NVIDIA kmod being loaded, race condition between configure-nvidia.service and bootstrap.sh) should be resolved in the latest AMI release, v20240409. 👍

This is probably a better workaround for now. Basically I’m taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to /etc/eks/containerd/containerd-config.toml knowing that the bootstrap process uses it.

Note I’m setting discard_unpacked_layers to false for my use case which helps with making sure the ! cmp -s ... in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for the sandbox_image, which I think would need to be updated based on these docs.

---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ...
  userData: |
    cat <<EOF > /etc/eks/containerd/containerd-config.toml
    imports = ["/etc/containerd/config.d/*.toml"]
    root = "/var/lib/containerd"
    state = "/run/containerd"
    version = 2

    [grpc]
      address = "/run/containerd/containerd.sock"

    [plugins]

      [plugins."io.containerd.grpc.v1.cri"]
        sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"

        [plugins."io.containerd.grpc.v1.cri".cni]
          bin_dir = "/opt/cni/bin"
          conf_dir = "/etc/cni/net.d"

        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "nvidia"
          discard_unpacked_layers = false

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
                BinaryName = "/usr/bin/nvidia-container-runtime"
                SystemdCgroup = true

            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
              runtime_type = "io.containerd.runc.v2"

              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
                SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".registry]
          config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
    EOF

Ideally I’d be able to build a GPU AMI with a modified bootstrap.sh script, but I can’t figure out where the GPU AMIs are coming from. Doesn’t seem like they’re open source?

I did more testing and found that my workaround accidentally and incorrectly fixes the issue.

What I think is really happening is that the configure-nvidia.service is completing before the bootstrap.sh process gets to this code here.

  if ! cmp -s /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml; then
    sudo cp -v /etc/eks/containerd/containerd-config.toml /etc/containerd/config.toml
    sudo cp -v /etc/eks/containerd/sandbox-image.service /etc/systemd/system/sandbox-image.service
    sudo chown root:root /etc/systemd/system/sandbox-image.service
    systemctl daemon-reload
    systemctl enable containerd sandbox-image
    systemctl restart sandbox-image containerd
  fi

The configure-nvidia service sets nvidia as the runtime in /etc/containerd/config.toml, but because it finishes before the bootstrap process, the bootstrap process overwrites this file because its different than /etc/eks/containerd/containerd-config.toml.

So how does my workaround “fix” this? From what I can tell the || true actually causes the configure-nvidia service to fail after its first startup (not sure why exactly). Then configure-nvidia is started again (not sure how either). This happens when the sandbox-image service is being restarted in the bootstrap process. And because sandbox-image service takes a bit to restart, the new containerd config is in place by now, and finally the containerd service is restarted right after sandbox-image in the bootstrap process.

I’ve stitched together logs from my observations. The “current.txt” doesn’t have my extra userdata.

I’ve run into the same issue as @korjek. I’m on EKS 1.29 with AMI amazon-eks-gpu-node-1.29-v20240307.

It appears the containerd config.toml is not being updated to use the nvidia runtime. I found the configure-nvidia.service and its corresponding script then tried to run it which gave me this output.

/etc/eks/configure-nvidia.sh
+ gpu-ami-util has-nvidia-devices
true
+ /etc/eks/nvidia-kmod-load.sh
true
0x2237 NVIDIA A10G
Disabling GSP for instance type: g5.xlarge
2024-03-15T21:59:42+0000 [kmod-util] unpacking: nvidia-open
Error! nvidia-open-535.161.07 is already added!
Aborting.

As a workaround, I patched /etc/eks/configure-nvidia.sh in my Karpenter userdata like so.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: nvidia-a10g
spec:
  # ... bunch of other stuff
  userData: |
    cat <<EOF > /etc/eks/configure-nvidia.sh
    #!/usr/bin/env bash

    set -o errexit
    set -o nounset
    set -o xtrace

    if ! gpu-ami-util has-nvidia-devices; then
      echo >&2 "no NVIDIA devices are present, nothing to do!"
      exit 0
    fi

    # patched with "|| true" to avoid failing on startup
    /etc/eks/nvidia-kmod-load.sh || true

    # add 'nvidia' runtime to containerd config, and set it as the default
    # otherwise, all Pods need to speciy the runtimeClassName
    nvidia-ctk runtime configure --runtime=containerd --set-as-default
    EOF

This issue should be fixed in https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240307. What release are you using?