karpenter-provider-aws: Race condition with nvidia-device-plugin?

Version

Karpenter: v0.6.4

Kubernetes: v1.21.5-eks-bc4871b

Expected Behavior

I have a pod (NVidia Riva) which has a resource limit of

resources:
  limits:
    nvidia.com/gpu: "1"

When I scale the deployment to 1 Karpenter should provision a new node. The node will eventually go Ready and the nvidia-device-plugin daemonset will start a pod, which will add the nvidia.com/cpu resource to the node.

The Riva pod will then start correctly on the new node.

Actual Behavior

Karpenter provisions a new GPU node correctly, however it also immediately schedules the Riva pod on to this new node.

At this point some sort of timer must start counting down, because approx 60 seconds later the Riva pod transitions state to OutOfnvidia.com/gpu and is now stuck.

This is because the node must start up enough to load the nvidia-device-plugin daemonset (in order to get the the nvidia.com/gpu resource added to the node) however this takes longer than 60 seconds usually.

These stuck pods remain and seem to be blocking the nvidia.com/gpu resource as at this point Karpenter attempts to provision another new node, leading to a loop which can result in multiple nodes being repeatedly launched over and over again.

riva-riva-api-768b77d764-74k4m   0/2     OutOfnvidia.com/gpu   0          5m55s
riva-riva-api-768b77d764-9gr5b   0/2     OutOfnvidia.com/gpu   0          4m42s
riva-riva-api-768b77d764-gcngv   0/2     OutOfnvidia.com/gpu   0          102s
riva-riva-api-768b77d764-qhc4k   0/2     OutOfnvidia.com/gpu   0          2m45s
riva-riva-api-768b77d764-thnp9   0/2     OutOfnvidia.com/gpu   0          3m48s
Events:
  Type     Reason               Age                    From               Message
  ----     ------               ----                   ----               -------
  Warning  FailedScheduling     5m41s (x2 over 5m43s)  default-scheduler  0/11 nodes are available: 1 node(s) had taint {karpenter.sh/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {ocrOnly: true}, that the pod didn't tolerate, 1 node(s) had taint {tritonOnly: true}, that the pod didn't tolerate, 4 Insufficient nvidia.com/gpu, 4 node(s) didn't match Pod's node affinity/selector.
  Warning  OutOfnvidia.com/gpu  4m52s                  kubelet            Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0

Steps to Reproduce the Problem

Install the Riva helm chart and scale deployment to 1.

Resource Specs and Logs

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-riva
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["g4dn.xlarge", "g5.xlarge"]
  taints:
    - key: rivaOnly
      value: "true"
      effect: NoExecute
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
  limits:
    resources:
      nvidia.com/gpu: 10
  provider:
    launchTemplate: karpenter-gpu
    subnetSelector:
      Tier: private
  ttlSecondsAfterEmpty: 300
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    meta.helm.sh/release-name: riva
    meta.helm.sh/release-namespace: api-dev
  creationTimestamp: "2022-02-16T08:48:28Z"
  generation: 10
  labels:
    app: riva-api
    app.kubernetes.io/managed-by: Helm
    chart: riva-api-1.8.0-beta
    heritage: Helm
    release: riva
  name: riva-riva-api
  namespace: api-dev
  resourceVersion: "112778449"
  uid: ba419874-da09-4915-bc1f-5310655ca24c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: riva-api
      release: riva
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: riva-api
        release: riva
    spec:
      containers:
      - args:
        - --asr_service=true
        - --nlp_service=false
        - --tts_service=false
        command:
        - /opt/riva/bin/start-riva
        env:
        - name: TRTIS_MODEL_STORE
          value: /data/models
        - name: LD_PRELOAD
        image: nvcr.io/nvidia/riva/riva-speech:1.8.0-beta-server
        imagePullPolicy: IfNotPresent
        livenessProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=:50051
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: riva-speech-api
        ports:
        - containerPort: 50051
          name: speech-grpc
          protocol: TCP
        - containerPort: 8000
          name: http
          protocol: TCP
        - containerPort: 8001
          name: grpc
          protocol: TCP
        - containerPort: 8002
          name: metrics
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=:50051
          failureThreshold: 3
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            nvidia.com/gpu: "1"
        startupProbe:
          exec:
            command:
            - /bin/grpc_health_probe
            - -addr=:50051
          failureThreshold: 12
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /data/
          name: workdir
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: imagepullsecret
      initContainers:
      - command:
        - download_and_deploy_ngc_models
        - -d
        - nvidia/riva/rmir_asr_citrinet_1024_asrset3p0_streaming:1.8.0-beta
        - nvidia/riva/rmir_asr_citrinet_1024_asrset3p0_offline:1.8.0-beta
        env:
        - name: NGC_CLI_ORG
          value: nvidia
        - name: NGC_CLI_API_KEY
          valueFrom:
            secretKeyRef:
              key: apikey
              name: modelpullsecret
        - name: MODEL_DEPLOY_KEY
          valueFrom:
            secretKeyRef:
              key: key
              name: riva-model-deploy-key
        image: nvcr.io/nvidia/riva/riva-speech:1.8.0-beta-servicemaker
        imagePullPolicy: IfNotPresent
        name: riva-model-init
        resources:
          limits:
            nvidia.com/gpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /rmir
          name: artifact-volume
        - mountPath: /data/
          name: workdir
      nodeSelector:
        karpenter.sh/provisioner-name: gpu-riva
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        key: rivaOnly
        operator: Exists
      volumes:
      - hostPath:
          path: /data/riva/
          type: DirectoryOrCreate
        name: artifact-volume
      - hostPath:
          path: /data/riva
          type: DirectoryOrCreate
        name: workdir
2022-03-03T14:13:51.627Z	INFO	controller.provisioning	Batched 1 pods in 1.00054383s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:13:51.632Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:13:53.629Z	INFO	controller.provisioning	Launched instance: i-0530447a932454fa5, hostname: ip-10-32-28-237.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:13:53.645Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-28-237.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:13:53.645Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:13:54.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-30-147.eu-west-1.compute.internal"}
2022-03-03T14:13:54.021Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-30-147.eu-west-1.compute.internal"}
2022-03-03T14:13:54.189Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-30-147.eu-west-1.compute.internal"}
2022-03-03T14:14:06.944Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-20-129.eu-west-1.compute.internal"}
2022-03-03T14:15:02.754Z	INFO	controller.provisioning	Batched 1 pods in 1.000881401s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:15:03.000Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-17-79.eu-west-1.compute.internal"}
2022-03-03T14:15:03.025Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-17-79.eu-west-1.compute.internal"}
2022-03-03T14:15:03.069Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g4dn.xlarge g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:15:03.236Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-17-79.eu-west-1.compute.internal"}
2022-03-03T14:15:05.330Z	INFO	controller.provisioning	Launched instance: i-0c21c14033f402684, hostname: ip-10-32-21-191.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:15:05.345Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-21-191.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:15:05.345Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:15:19.663Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-28-237.eu-west-1.compute.internal"}
2022-03-03T14:16:09.355Z	INFO	controller.provisioning	Batched 1 pods in 1.000995209s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:16:09.571Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:16:11.563Z	INFO	controller.provisioning	Launched instance: i-0c4bc55b215429d34, hostname: ip-10-32-24-230.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:16:11.581Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-24-230.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:16:11.581Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:16:18.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-21-84.eu-west-1.compute.internal"}
2022-03-03T14:16:18.019Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-21-84.eu-west-1.compute.internal"}
2022-03-03T14:16:18.184Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-21-84.eu-west-1.compute.internal"}
2022-03-03T14:16:25.459Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-21-191.eu-west-1.compute.internal"}
2022-03-03T14:17:18.598Z	INFO	controller.provisioning	Batched 1 pods in 1.001033588s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:17:18.759Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:17:20.674Z	INFO	controller.provisioning	Launched instance: i-06485eeeab5691aa9, hostname: ip-10-32-22-237.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:17:20.690Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-22-237.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:17:20.690Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:17:24.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-19-108.eu-west-1.compute.internal"}
2022-03-03T14:17:24.018Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-19-108.eu-west-1.compute.internal"}
2022-03-03T14:17:24.207Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-19-108.eu-west-1.compute.internal"}
2022-03-03T14:17:34.697Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-24-230.eu-west-1.compute.internal"}
2022-03-03T14:18:07.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-25-189.eu-west-1.compute.internal"}
2022-03-03T14:18:07.018Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-25-189.eu-west-1.compute.internal"}
2022-03-03T14:18:07.247Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-25-189.eu-west-1.compute.internal"}
2022-03-03T14:18:27.593Z	INFO	controller.provisioning	Batched 1 pods in 1.000639619s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:18:27.756Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:18:29.778Z	INFO	controller.provisioning	Launched instance: i-07cc1f656002a5b3f, hostname: ip-10-32-17-13.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:18:29.793Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-17-13.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:18:29.793Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:18:43.750Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-22-237.eu-west-1.compute.internal"}
2022-03-03T14:19:06.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-20-129.eu-west-1.compute.internal"}
2022-03-03T14:19:06.021Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-20-129.eu-west-1.compute.internal"}
2022-03-03T14:19:06.297Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-20-129.eu-west-1.compute.internal"}
2022-03-03T14:19:38.911Z	INFO	controller.provisioning	Batched 1 pods in 1.001071464s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:19:39.106Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g4dn.xlarge g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:19:40.981Z	INFO	controller.provisioning	Launched instance: i-090afc78e782f4a76, hostname: ip-10-32-19-214.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:19:41.012Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-19-214.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:19:41.012Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:19:56.032Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-17-13.eu-west-1.compute.internal"}
2022-03-03T14:20:19.000Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-28-237.eu-west-1.compute.internal"}
2022-03-03T14:20:19.030Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-28-237.eu-west-1.compute.internal"}
2022-03-03T14:20:19.250Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-28-237.eu-west-1.compute.internal"}
2022-03-03T14:20:51.736Z	INFO	controller.provisioning	Batched 1 pods in 1.000044154s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:20:52.130Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:20:54.102Z	INFO	controller.provisioning	Launched instance: i-00ee81c90d469d369, hostname: ip-10-32-20-177.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:20:54.117Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-20-177.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:20:54.118Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:21:08.840Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-19-214.eu-west-1.compute.internal"}
2022-03-03T14:21:25.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-21-191.eu-west-1.compute.internal"}
2022-03-03T14:21:25.020Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-21-191.eu-west-1.compute.internal"}
2022-03-03T14:21:25.304Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-21-191.eu-west-1.compute.internal"}
2022-03-03T14:21:45.696Z	INFO	controller.provisioning	Batched 1 pods in 1.000961501s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:21:45.702Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:21:47.857Z	INFO	controller.provisioning	Launched instance: i-02e9b47ca243fc22d, hostname: ip-10-32-30-183.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:21:47.872Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-30-183.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:21:47.872Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:22:00.966Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-20-177.eu-west-1.compute.internal"}
2022-03-03T14:22:00.984Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-20-177.eu-west-1.compute.internal"}
2022-03-03T14:22:34.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-24-230.eu-west-1.compute.internal"}
2022-03-03T14:22:34.024Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-24-230.eu-west-1.compute.internal"}
2022-03-03T14:22:34.220Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-24-230.eu-west-1.compute.internal"}
2022-03-03T14:22:48.141Z	INFO	controller.provisioning	Batched 1 pods in 1.000828756s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:22:48.147Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g4dn.xlarge g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:22:50.060Z	INFO	controller.provisioning	Launched instance: i-01f2158e05faacc43, hostname: ip-10-32-27-174.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:22:50.084Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-27-174.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:22:50.084Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:23:03.505Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-30-183.eu-west-1.compute.internal"}
2022-03-03T14:23:43.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-22-237.eu-west-1.compute.internal"}
2022-03-03T14:23:43.019Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-22-237.eu-west-1.compute.internal"}
2022-03-03T14:23:43.233Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-22-237.eu-west-1.compute.internal"}
2022-03-03T14:23:51.036Z	INFO	controller.provisioning	Batched 1 pods in 1.000458171s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:23:51.042Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:23:53.060Z	INFO	controller.provisioning	Launched instance: i-09faa781565355cff, hostname: ip-10-32-21-129.eu-west-1.compute.internal, type: g5.xlarge, zone: eu-west-1b, capacityType: spot	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:23:53.077Z	INFO	controller.provisioning	Bound 1 pod(s) to node ip-10-32-21-129.eu-west-1.compute.internal	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:23:53.078Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:24:07.137Z	INFO	controller.node	Added TTL to empty node	{"commit": "82ea63b", "node": "ip-10-32-27-174.eu-west-1.compute.internal"}
2022-03-03T14:24:56.001Z	INFO	controller.node	Triggering termination after 5m0s for empty node	{"commit": "82ea63b", "node": "ip-10-32-17-13.eu-west-1.compute.internal"}
2022-03-03T14:24:56.019Z	INFO	controller.termination	Cordoned node	{"commit": "82ea63b", "node": "ip-10-32-17-13.eu-west-1.compute.internal"}
2022-03-03T14:24:56.210Z	INFO	controller.termination	Deleted node	{"commit": "82ea63b", "node": "ip-10-32-17-13.eu-west-1.compute.internal"}
2022-03-03T14:25:02.152Z	INFO	controller.provisioning	Batched 1 pods in 1.000620896s	{"commit": "82ea63b", "provisioner": "gpu-riva"}
2022-03-03T14:25:02.159Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 1 pod(s) with instance type option(s) [g5.xlarge]	{"commit": "82ea63b", "provisioner": "gpu-riva"}

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 30 (22 by maintainers)

Most upvoted comments

I was able to reproduce this by adding nvidia.com/gpu: "1" under spec.resources.limits

After that running kubectl get node -l karpenter.sh/provisioner-name -o json -w | jq -r ".status.capacity" shows that "nvidia.com/gpu" is being set to 0.

It appears that the kubelet is zeroing out the extended resource as indicated in the logs:

Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.534878   22005 cpu_manager.go:199] "Starting CPU manager" policy="none"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.534897   22005 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.534918   22005 state_mem.go:36] "Initialized new in-memory state store"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.536846   22005 policy_none.go:44] "None policy: Start"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.556400   22005 kubelet_node_status.go:109] "Node was previously registered" node="ip-192-168-97-57.ec2.internal"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.556425   22005 kubelet_node_status.go:275] "Controller attach-detach setting changed to true; updating existing Node"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.556498   22005 kubelet_node_status.go:182] "Zero out resource capacity in existing node" resourceName="nvidia.com/gpu" node="ip-192-168-97-57.ec2.internal"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.558669   22005 manager.go:242] "Starting Device Plugin manager"
Apr 11 16:46:56 ip-192-168-97-57.ec2.internal kubelet[22005]: I0411 16:46:56.558735   22005 manager.go:600] "Failed to retrieve checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"

Here’s where that happens: https://github.com/kubernetes/kubernetes/blob/39c76ba2edeadb84a115cc3fbd9204a2177f1c28/pkg/kubelet/kubelet_node_status.go#L178

We are taking a deeper look at this, we will update the ticket with next steps.

Sorry, I don’t know when this will be released but it is a priority. There is a PR up at https://github.com/aws/karpenter/pull/1837 which combines eliminating pod binding with also eliminating node creation which solves another issue.

@tzneal This snapshot is working well for us. nvidia.com/gpu is being reset correctly, so nodes are being reused. We’re no longer seeing the intermittent OutOfnvidia.com/gpu when the pod runs the first time on a gpu node. 🎉

EDIT: We ended up just removing the nvidia-device plugin, since karpenter sets the nvidia.com/gpu node resource by itself. So far it’s looking fine.

This will sometimes work, the problem is that occasionally kubelet will zero-out the extended resources on startup, setting the number of GPUs to zero. I am actively working on a solution for this.

We are considering the option of not-binding, and that should help with this and a few other issues.

Any update about this issue?

I can confirm that when the new node initially comes up it does not have any GPU resources listed.

This sounds like the root cause. We know the number of GPUs from the instance type. Are you interested in contributing this fix?

Yep! Have submitted a PR.