amazon-eks-ami: Problem with NVIDIA GSP and g4dn, g5, and g5g instances

What happened:

We provisioned a g5.* instance and it was booted with the latest ami Release v20231116 When we try to run any gpu workloads, container toolkit (cli) fails to communicate with gpu devices. When we shell into the node and run nvidia-smi -q it really struggles to get output and bunch of values are Unknown Error

Adding lscpu and nvidia-smi logs lscpu+nvidia-smi.log.txt

Workload runc errors

Error: failed to create containerd task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 --pid=15676 /run/containerd/io.containerd.runtime.v1.linux/k8s.io/reranker-a10/rootfs] nvidia-container-cli: initialization error: driver error: timed out: unknown

I am reporting this because we have seen similar issues in last few days with A100 + Driver 535 + AMD EPYC configurations someplace else

How to reproduce it (as minimally and precisely as possible): Provision a g5 instance with latest AMI, run nvidia-smi -q on host

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): g5.8xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.7
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.27 (v1.27.7-eks-4f4795d)
  • AMI Version: amazon-eks-gpu-node-1.27-v20231116
  • AMI ID: ami-04358af1a6af90875
  • Kernel (e.g. uname -a): Linux ip-10-2-53-244.eu-west-1.compute.internal 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0fe9073bb890001f8"
BUILD_TIME="Thu Nov 16 03:14:20 UTC 2023"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 2
  • Comments: 31 (14 by maintainers)

Most upvoted comments

In my tests, the kmod param EnableGpuFirmware just doesn’t have any effect with the open kmod. It doesn’t seem to be used anywhere in the code, but it’s 100% possible I’m misreading things: https://github.com/NVIDIA/open-gpu-kernel-modules/blob/3bf16b890caa8fd6b5db08b5c2437b51c758ac9d/kernel-open/nvidia/nv.c#L131-L133

We intend to load the proprietary kmod on g5 types for the time being so that EnableGpuFirmware has the intended effect. That will ship in the next AMI release (the one after v20240315).

I confirm we have the same problem on g5.4xlarge with the new AMI, the GPU is totally dead:

Kubelet:

Pod Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected
$ nvidia-smi -a

=============NVSMI LOG==============

Timestamp                                 : Fri Nov 24 11:07:21 2023
Driver Version                            : 535.54.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : NVIDIA A10G
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : Unknown Error
        Pending                           : Unknown Error
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1322321021225
    GPU UUID                              : Unknown Error
    Minor Number                          : 0
    VBIOS Version                         : Unknown Error
    MultiGPU Board                        : No
    Board ID                              : 0x1e
    Board Part Number                     : 900-2G133-A840-000
    GPU Part Number                       : Unknown Error
    FRU Part Number                       : N/A
    Module ID                             : Unknown Error
[...]

@bhavitsharma the GPU cards on P2s do not support 5xx series drivers. The 1.28 GPU AMI has always provided the 535 driver, but starting with release v20231116, 1.25+ GPU AMIs are all shipping with the 535 driver as well. Therefore, P2s will not work with these AMIs

The latest release (which will complete today) addresses this issue in Kubernetes 1.29 for g4dn instances, by disabling GSP automatically. We’re still working on the right solution for g5 instance types. These instances support EFA and require the open-source NVIDIA kmod as a result, but the GSP feature cannot be disabled on the open-source kmod. We’re following up with EC2 and NVIDIA regarding this issue.

Roughly yes, though we’re only ever assigning 1 GPU and I am not sure if we’re setting those env vars (its not on our pod spec, but maybe we set it within the container image).

This image is a health checking image that performs a bunch of health checks constantly, including those nvidia-smi commands (largely to discover issues on GPUs that have caused us issues in the past). It’s been a few weeks but I’m also not sure if it was caused by a pod restarting, the daemonset pod being recreated/updated or both.

Sorry for the delay; we’re doing a rework of our NVIDIA setup to address #1494 which has taken priority.

Is the EKS team considering disabling GSP

Yes, I expect to get a fix out for this in the next few weeks.

@chiragjn thanks! That certainly looks like the smoking gun. Requiring a reboot puts us in a tough position; and I’m not sure we can do something at runtime before systemd-modules-load runs, that happens very early in the boot process. I’ll see what I can come up with 👍

it’s a layer above what’s here in this repo @chiragjn (cough! check license of things cough!)