aws-ebs-csi-driver: instance volume limits: workloads no longer attach ebs volumes

/kind bug

What happened? Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.

Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.

kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition

attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

How to reproduce it (as minimally and precisely as possible)?

Deploying the test below should be sufficient in simulating the problem

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 10          
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Environment

  • Kubernetes version: Server Version: version.Info{Major:“1”, Minor:“21+”, GitVersion:“v1.21.5-eks-bc4871b”, GitCommit:“5236faf39f1b7a7dabea8df12726f25608131aa9”, GitTreeState:“clean”, BuildDate:“2021-10-29T23:32:16Z”, GoVersion:“go1.16.8”, Compiler:“gc”, Platform:“linux/amd64”}
  • Version: Helm Chart: v2.6.2 Driver v1.5.0

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 38
  • Comments: 56 (20 by maintainers)

Most upvoted comments

@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.

The first is the incorrect max value that doesn’t take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you’re using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.

The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.

Unless I’m mistaken, this still seems to be an issue in Kubernetes v1.28 (on EKS) with version v1.23.1 of the EBS CSI Driver. The following (albeit unrealistic) example reproduces the problem, by trying to send 26 pods in a statefulset with one PVC each to the same node. I would hope the scheduler wouldn’t do this, but instead on my nodes it gets to 24 pods and the 24th gets stuck in a pending state complaining that it can’t attach the volume. Is this likely to be fixed in an upcoming release? It’s causing us major problems. Obviously we’re not trying to send 26 pods with PVCs to the same node, but intermittently in our application the scheduler tries to schedule a pod with a PVC that won’t attach because the attachment quota has been breached, causing downtime and instability. Is there any workaround for this? Thanks in advance.

apiVersion: v1
kind: Namespace
metadata:
  name: vols-test
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols-test
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 26
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      affinity:
        podAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
              matchExpressions:
               - key: app
                 operator: In
                 values:
                 - vols-pv-test
             topologyKey: "kubernetes.io/hostname"
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "ebs-sc" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Hello,

… update from our side:

Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a ‘quick fix’):

… cordon the current node that the pod is stuck in ‘Init …’ on. … delete the pod … … verify that the pod is started successfully on an alternative node. If not … … repeat ‘cordoning’ until the pod is successfully deployed. … uncorden (all) node(s) upon successful deployment.

Then following a dive into the CSI EBS driver code, we passed the option ‘–volume-attach-limit=50’ to the ‘node driver’. I haven’t tested this explicitly yet however.

The problem to me seems to be a missing feedback loop between the ‘node driver’ and the scheduler.

The scheduler says, “Hey, there’s a node that satisfies my scheduling criteria … I’ll schedule the workload to run there …” and the node driver says, “OK, I have a workload but I’ve reached this ‘25 attached volumes’ limit so I’m done here …”.

This is just my perhaps primitive view of the situation.

Thanks,

Phil.

PS … following a re-deployment of the ‘csi ebs node driver’ we are still seeing the attribute ‘attachable-volumes-aws-ebs’ as set to 25 on a ‘describe node’:

image

… we weren’t expecting this.

@idanshaby The add-on schema has already been updated to include this parameter!

$ eksctl utils describe-addon-configuration --name aws-ebs-csi-driver --version v1.25.0-eksbuild.1 | yq

    "additionalDaemonSets": {
      "default": {},
      "description": "Additional DaemonSets of the node pod",
      "patternProperties": {
        "^.*$": {
          "$ref": "#/properties/node",
          "type": "object"
        }
      },
      "type": "object"
    },

@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via CSINode objects. if we are not rebuilding CSINode objects during redeploy of driver - that sounds like a bug. So setting --volume-attach-limit and redeploying driver should set correct limits.

As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by CSINode then that would be a k/k bug (and we are going to need one).

@stevehipwell thank your for your prompt and detailed reply and explanation. I have a clearer scope now.

Just to provide some clarity, I am trying to stress test my cluster with some workloads, through the use of Helm. Sadly those workloads require (the majority of them) an EBS volume attachment and this is when I ran into the error message, AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-xxxxxxx" to node "i-xxxxxxx": attachment of disk "vol-xxxxxx" failed, expected device to be attached but was attaching which is what led me here in the first place.

From reading both your replies and searching also the docs you recommended, I relised the following:

  • One ENI is immediately attached to one Node and
  • (if the behavior is not changed explicitly) a second ENI is pre-emptively attached as soon as the first one starts being used
  • those ENIs and the root volume of the Node consume the number of max EBS attachments you can have on a Node
  • the max EBS volume attachments you can have on a Node is pre-defined and unchangeable

So within the context of my use case - when I try a lot of small workloads on an m6a Node which is otherwise capable of supporting hundreds of Pods I am inevitably going to run into the issue of “running out” of available attachments if all my Pods require their own volume.

To make matters worse a large number of small Pods all requiring IP addresses increases the amount of ENI attachments on my Node which further lowers my available EBS attachments.

So I could try bumping the version of the driver per your suggestion but if I got everything correctly , that wouldn’t help in what I am trying to do. The “sensible” thing is to use either larger workloads to fill-up my Node or workloads that don’t require EBS attachments in their entirety.

@sotiriougeorge I might be completely missing the point of your question here, and you really need to provide the actual version you’re using, but I’ll go on.

Firstly as the m6a instance is a Nitro instance you’re physically limited to 28 attachments (see docs) of which some might also be used for non-volume attachments. So I don’t see any value in you raising the limit to 70 as it’s not physically possible to attach more than 27 volumes (if attaching nothing else as they all need an ENI) to a Nitro instance.

Secondly this thread (and others) covers the limitations in earlier version of this CSI driver specifically around detecting anything not a 5th generation instance as a Nitro instance and not calculating available attachments using the existing attachments. All of these are fixed in recent versions of the CSI driver. I’ve seen you’ve commented on https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1258 so you’re aware that you should discount the attachable-volumes-aws-ebs value and look at the CSINode instead.

My recommendation would be to update to the latest version of the driver and see how that works for you.

@jmhwang7

We initially thought this was due to a race condition in the Kubernetes scheduler, and so we lowered the manually set volume-attach-limit. We are still getting paged/running into this issue despite the fact that the node has significantly less volumes attached than the ec2 instance can support (24 volumes + 1 eni = 25, 28 is the limit for the nitro instance, 24th volume gets stuck in attaching).

What I’m seeing is that calculation is much more complex than “28 for Nitro”, so suggest trying to lower the number of volumes to something like 23 (as naturally all our csinode objects are 25 max right now on Nitro)

@jmhwang7 nothing got out of there, they provided the exact “calculation” for nitro volumes that is already used in the driver. See their answer below:

Please allow me to inform you that most of the Nitro instances support a maximum of 28 attachments. Attachments include network interfaces, EBS volumes, and NVMe instance store volumes.

However, there are exceptions for few Nitro instances. For these instances, the following limits apply:

  • d3.8xlarge and d3en.12xlarge instances support a maximum of 3 EBS volumes.
  • inf1.xlarge and inf1.2xlarge instances support a maximum of 26 EBS volumes.
  • inf1.6xlarge instances support a maximum of 23 EBS volumes.
  • inf1.24xlarge instances support a maximum of 11 EBS volumes.
  • Most bare metal instances support a maximum of 31 EBS volumes.
  • mac1.metal instances support a maximum of 16 EBS volumes.
  • High memory virtualized instances support a maximum of 27 EBS volumes.
  • High memory bare metal instances support a maximum of 19 EBS volumes. If you launched a u-6tb1.metal, u-9tb1.metal, or u-12tb1.metal high memory bare metal instance before March 12, 2020, it supports a maximum of 14 EBS volumes. To attach up to 19 EBS volumes to these instances, contact your account team to upgrade the instance at no additional cost.

The same is mentioned in the below AWS documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html#instance-type-volume-limits

All other AWS Nitro instances (excluding the above mentioned exceptions) support maximum of 28 attachments.

We run and manage our own k8s clusters on top of EC2 instances, but we run the aws-ebs-csi-driver to manage ebs volumes.

We discovered a pretty gnarly bug in 1.10: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1361 . After setting the volume-attach-limit as suggested in that issue, we started seeing exactly what @jortkoopmans detailed in this comment: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1163#issuecomment-1237132913 . Even for older nodes (multiple days/weeks old), when 2 pods get scheduled around the same time, only the N-1 pod, where N is the volume-attach-limit set, has its volume attached successfully and the Nth pod’s volume gets stuck in a “attaching” state.

We initially thought this was due to a race condition in the Kubernetes scheduler, and so we lowered the manually set volume-attach-limit. We are still getting paged/running into this issue despite the fact that the node has significantly less volumes attached than the ec2 instance can support (24 volumes + 1 eni = 25, 28 is the limit for the nitro instance, 24th volume gets stuck in attaching).

@pkit did you have any luck with the support ticket you filed with AWS?

@jortkoopmans it looks like it’s exactly #1278 I’ve tested it too. It’s always the last two pods that fail if scheduled dynamically one after another. Essentially it’s a deal breaker for any dynamic allocation of pods.

@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn’t related to storage attachments. This issue wasn’t ever about the correct max value being set for attachments, that’s a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn’t make much sense.