karpenter-provider-aws: Karpenter does not respect volume-attach-limit set for EBS volumes.

Version

Karpenter Version: v0.27.0

Kubernetes Version: v1.24.10

Expected Behavior

Hello. We run Karpenter on EKS and limit the attacheble EBS volumes by setting --volume-attach-limit on the ebs-csi-node. I would expect Karpenter to create new node, if existing nodes hit the limit already attached EBS volumes.

Actual Behavior

After the limit of attacheble EBS volumes over all nodes is hit, the PODs are stuck in pending state. No new nodes are created. All PODs show the same event:

0/5 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) exceed max volume count. preemption: 0/5 nodes are available: 2 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.

Steps to Reproduce the Problem

On a test Cluster set --volume-attach-limit=5 on the ebs-csi-node. 3 Nodes available to run the following deployment with 20 replicas. I expect to get at least 1 additional node to run the entire workload.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: echoserver
  namespace: default
spec:
  replicas: 20
  selector:
    matchLabels:
      app: echoserver
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
        - image: ealen/echo-server:latest
          imagePullPolicy: IfNotPresent
          name: echoserver
          ports:
            - name: http
              containerPort: 8080
          env:
            - name: PORT
              value: '8080'
          resources:
            requests:
              memory: 64Mi
              cpu: 10m
            limits:
              memory: 128Mi
              cpu: 40m
          securityContext:
            runAsNonRoot: true
            runAsUser: 101
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: false
            capabilities:
              drop:
                - ALL
            seccompProfile:
              type: RuntimeDefault
          volumeMounts:
          - mountPath: "/scratch"
            name: scratch-volume
      securityContext:
        fsGroup: 101
      volumes:
        - name: scratch-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: my-test-volume
              spec:
                accessModes: [ "ReadWriteOnce" ]
                resources:
                  requests:
                    storage: 3Gi

Here you can see the limitation by the ebs driver does work:

kubectl get nodes -o json | jq '.items[] | {"nodeName": .metadata.name, "volumesInUse": .status.volumesInUse | length, "volumesAttached": .status.volumesAttached | length }'
{
  "nodeName": "ip-xx-xx-xx-xx.eu-central-1.compute.internal",
  "volumesInUse": 5,
  "volumesAttached": 5
}
{
  "nodeName": "ip-xx-xx-xx-xx.eu-central-1.compute.internal",
  "volumesInUse": 5,
  "volumesAttached": 5
}
{
  "nodeName": "ip-xx-xx-xx-xx.eu-central-1.compute.internal",
  "volumesInUse": 5,
  "volumesAttached": 5
}

Also here:

k get csinode ip-xx-xx-xx-xx.eu-central-1.compute.internal -o yaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  ...
spec:
  drivers:
  - allocatable:
      count: 5
    name: ebs.csi.aws.com
    nodeID: i-xxxxxxxxxxxx
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

Resource Specs and Logs

There is literally nothing in the carpenter controller logs after deployment was applied.

Provisioner spec:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  consolidation:
    enabled: true
  kubeletConfiguration:
    clusterDNS:
      - xxxxxx
    maxPods: 110
  limits:
    resources:
      cpu: 1k
  providerRef:
    name: private-node
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - spot
        - on-demand
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64
    - key: karpenter.k8s.aws/instance-hypervisor
      operator: In
      values:
        - nitro
    - key: karpenter.k8s.aws/instance-cpu
      operator: Gt
      values:
        - '3'
    - key: karpenter.k8s.aws/instance-cpu
      operator: Lt
      values:
        - '129'
    - key: kubernetes.io/os
      operator: In
      values:
        - linux
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values:
        - c
        - m
        - r
    - key: karpenter.k8s.aws/instance-generation
      operator: Gt
      values:
        - '2'
  startupTaints:
    - effect: NoExecute
      key: node.cilium.io/agent-not-ready
      value: 'true'
  ttlSecondsUntilExpired: 604800
  weight: 95

Community Note

  • Please vote on this issue by adding a đź‘Ť reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 13
  • Comments: 16 (8 by maintainers)

Most upvoted comments

If there are no logs, have you tried restarting the karpenter pods? Also, could you enable debug logs on your logging if you haven’t already? It’s difficult to debug this issue or figure out what’s going on without the logs here.

I’m not able to repro issue when I deploy a similar configuration on my cluster running the same version of Karpenter. I did a scale-up with 100 statefulSets of 1 replica each, generating 100 PVCs and a volume limit of 5 and Karpenter scaled me up to 20 nodes.

@jonathan-innis Cool. So we finally nailed down the problem. It helped indeed to set the storageClassName. Looking forward to have the fix generally available. Many thanks. 👍🏻

@sdomme Yep, you’re right. This problem is specific to volumeClaimTemplates that exist in the ephemeral storage volume bucket for pods. It looks like we aren’t discovering the default StorageClass name when using the default StorageClass. Can you try to specify the storageClassName as a workaround for now and see if that fixes your issue?

@jonathan-innis Please find the attachments requested.

final-output.zip

Can you share the full output -o yaml from all the above. Also, getting the full output -o yaml from nodes that Karpenter scheduled for these PVs would be good as well. You can attach them as files in your response if they get too long.

Can you share the events from the pods that aren’t scheduling? Ideally, want to see if Karpenter thinks that the pods should schedule on your current set of nodes

This looks to be a potential duplicate of https://github.com/aws/karpenter-core/issues/260. Can you confirm if you might be hitting the edge case that is described here