amazon-eks-ami: Some instance types using incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227
What happened:
I run a p3.2xlarge node group in my 1.29 EKS cluster. I updated the node group’s AMI image to AMI ID ami-07c8bc6b0bb890e9e (amazon-eks-gpu-node-1.29-v20240227). After the update I was unable to deploy my CUDA containers to the node. I ssh’d into the node and found nvidia-smi couldn’t communicate with the GPU:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
What you expected to happen:
Should be able to communicate with the Tesla GPU without manual intervention
How to reproduce it (as minimally and precisely as possible):
Deploy a p3.2xlarge node on a 1.29 cluster using the latest AMI image.
Anything else we need to know?:
Environment:
- AWS Region: us-east-2
- Instance Type(s): p3.2xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): 1.29 - AMI Version: amazon-eks-gpu-node-1.29-v20240227
- Kernel (e.g.
uname -a):Linux ip-10-20-40-96.us-east-2.compute.internal 5.10.209-198.858.amzn2.x86_64 #1 SMP Tue Feb 13 18:46:41 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux - Release information (run
cat /etc/eks/releaseon a node):
BASE_AMI_ID="ami-0e3ec26ca86336aea"
BUILD_TIME="Tue Feb 27 23:54:40 UTC 2024"
BUILD_KERNEL="5.10.209-198.858.amzn2.x86_64"
ARCH="x86_64"
Everything should work out of the box, but I can manually fix this by removing the default nvidia-dkms files and reinstalling the dkms module for the stated version of the nvidia driver this latest AMI version purportedly supports:
sudo rm -r /var/lib/dkms/nvidia
sudo dkms install nvidia/535.161.07 --force
Then if I run nvidia-smi I get:
Fri Mar 1 04:33:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 |
| N/A 24C P0 38W / 300W | 0MiB / 16384MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Reactions: 3
- Comments: 16 (5 by maintainers)
Both of the issues mentioned here (incorrect NVIDIA kmod being loaded, race condition between
configure-nvidia.serviceandbootstrap.sh) should be resolved in the latest AMI release, v20240409. 👍This is probably a better workaround for now. Basically I’m taking the would-be nvidia-ctk generated containerd config (from the configure-nvidia service) and writing it to
/etc/eks/containerd/containerd-config.tomlknowing that the bootstrap process uses it.Note I’m setting
discard_unpacked_layersto false for my use case which helps with making sure the! cmp -s ...in the bootstrap.sh script runs its block of code. One caveat is the hardcoded account id for thesandbox_image, which I think would need to be updated based on these docs.Ideally I’d be able to build a GPU AMI with a modified bootstrap.sh script, but I can’t figure out where the GPU AMIs are coming from. Doesn’t seem like they’re open source?
I did more testing and found that my workaround accidentally and incorrectly fixes the issue.
What I think is really happening is that the
configure-nvidia.serviceis completing before the bootstrap.sh process gets to this code here.The configure-nvidia service sets nvidia as the runtime in
/etc/containerd/config.toml, but because it finishes before the bootstrap process, the bootstrap process overwrites this file because its different than/etc/eks/containerd/containerd-config.toml.So how does my workaround “fix” this? From what I can tell the
|| trueactually causes the configure-nvidia service to fail after its first startup (not sure why exactly). Then configure-nvidia is started again (not sure how either). This happens when the sandbox-image service is being restarted in the bootstrap process. And because sandbox-image service takes a bit to restart, the new containerd config is in place by now, and finally the containerd service is restarted right after sandbox-image in the bootstrap process.I’ve stitched together logs from my observations. The “current.txt” doesn’t have my extra userdata.
I’ve run into the same issue as @korjek. I’m on EKS 1.29 with AMI
amazon-eks-gpu-node-1.29-v20240307.It appears the containerd config.toml is not being updated to use the nvidia runtime. I found the
configure-nvidia.serviceand its corresponding script then tried to run it which gave me this output.As a workaround, I patched
/etc/eks/configure-nvidia.shin my Karpenter userdata like so.This issue should be fixed in https://github.com/awslabs/amazon-eks-ami/releases/tag/v20240307. What release are you using?