amazon-eks-ami: Problem with NVIDIA GSP and g4dn, g5, and g5g instances
What happened:
We provisioned a g5.* instance and it was booted with the latest ami Release v20231116
When we try to run any gpu workloads, container toolkit (cli) fails to communicate with gpu devices. When we shell into the node and run nvidia-smi -q it really struggles to get output and bunch of values are Unknown Error
Adding lscpu and nvidia-smi logs lscpu+nvidia-smi.log.txt
Workload runc errors
Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=15676 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/reranker-a10/rootfs] nvidia-container-cli: initialization error: driver error: timed out: unknown
I am reporting this because we have seen similar issues in last few days with A100 + Driver 535 + AMD EPYC configurations someplace else
How to reproduce it (as minimally and precisely as possible):
Provision a g5 instance with latest AMI, run nvidia-smi -q on host
Environment:
- AWS Region: eu-west-1
- Instance Type(s): g5.8xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.7 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): 1.27 (v1.27.7-eks-4f4795d) - AMI Version: amazon-eks-gpu-node-1.27-v20231116
- AMI ID: ami-04358af1a6af90875
- Kernel (e.g.
uname -a):Linux ip-10-2-53-244.eu-west-1.compute.internal 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - Release information (run
cat /etc/eks/releaseon a node):
BASE_AMI_ID="ami-0fe9073bb890001f8"
BUILD_TIME="Thu Nov 16 03:14:20 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"
About this issue
- Original URL
- State: open
- Created 7 months ago
- Reactions: 2
- Comments: 31 (14 by maintainers)
In my tests, the kmod param
EnableGpuFirmwarejust doesn’t have any effect with the open kmod. It doesn’t seem to be used anywhere in the code, but it’s 100% possible I’m misreading things: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/3bf16b890caa8fd6b5db08b5c2437b51c758ac9d/kernel-open/nvidia/nv.c#L131-L133We intend to load the proprietary kmod on g5 types for the time being so that
EnableGpuFirmwarehas the intended effect. That will ship in the next AMI release (the one after v20240315).I confirm we have the same problem on
g5.4xlargewith the new AMI, the GPU is totally dead:Kubelet:
@bhavitsharma the GPU cards on P2s do not support 5xx series drivers. The 1.28 GPU AMI has always provided the 535 driver, but starting with release
v20231116, 1.25+ GPU AMIs are all shipping with the 535 driver as well. Therefore, P2s will not work with these AMIsThe latest release (which will complete today) addresses this issue in Kubernetes 1.29 for
g4dninstances, by disabling GSP automatically. We’re still working on the right solution forg5instance types. These instances support EFA and require the open-source NVIDIA kmod as a result, but the GSP feature cannot be disabled on the open-source kmod. We’re following up with EC2 and NVIDIA regarding this issue.Roughly yes, though we’re only ever assigning 1 GPU and I am not sure if we’re setting those env vars (its not on our pod spec, but maybe we set it within the container image).
This image is a health checking image that performs a bunch of health checks constantly, including those nvidia-smi commands (largely to discover issues on GPUs that have caused us issues in the past). It’s been a few weeks but I’m also not sure if it was caused by a pod restarting, the daemonset pod being recreated/updated or both.
Sorry for the delay; we’re doing a rework of our NVIDIA setup to address #1494 which has taken priority.
Yes, I expect to get a fix out for this in the next few weeks.
@chiragjn thanks! That certainly looks like the smoking gun. Requiring a reboot puts us in a tough position; and I’m not sure we can do something at runtime before
systemd-modules-loadruns, that happens very early in the boot process. I’ll see what I can come up with 👍it’s a layer above what’s here in this repo @chiragjn (cough! check license of things cough!)