aws-ebs-csi-driver: instance volume limits: workloads no longer attach ebs volumes
/kind bug
What happened? Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.
Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.
kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition
attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching
ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching
How to reproduce it (as minimally and precisely as possible)?
Deploying the test below should be sufficient in simulating the problem
apiVersion: v1
kind: Namespace
metadata:
name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: vols
name: vols-pv-test
spec:
selector:
matchLabels:
app: vols-pv-test
serviceName: "vols-pv-tester"
replicas: 60
template:
metadata:
labels:
app: vols-pv-test
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
volumeMounts:
- name: test-volume
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: test-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "******" # something with a reclaim policy of delete
resources:
requests:
storage: 1Gi
Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.
apiVersion: v1
kind: Namespace
metadata:
name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: vols
name: vols-pv-test
spec:
selector:
matchLabels:
app: vols-pv-test
serviceName: "vols-pv-tester"
replicas: 60
template:
metadata:
labels:
app: vols-pv-test
spec:
containers:
- name: nginx
image: k8s.gcr.io/nginx-slim:0.8
ports:
- containerPort: 80
name: web
livenessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 60
periodSeconds: 10
volumeMounts:
- name: test-volume
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: test-volume
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "******" # something with a reclaim policy of delete
resources:
requests:
storage: 1Gi
Environment
- Kubernetes version: Server Version: version.Info{Major:“1”, Minor:“21+”, GitVersion:“v1.21.5-eks-bc4871b”, GitCommit:“5236faf39f1b7a7dabea8df12726f25608131aa9”, GitTreeState:“clean”, BuildDate:“2021-10-29T23:32:16Z”, GoVersion:“go1.16.8”, Compiler:“gc”, Platform:“linux/amd64”}
- Version: Helm Chart: v2.6.2 Driver v1.5.0
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 38
- Comments: 56 (20 by maintainers)
@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.
The first is the incorrect max value that doesn’t take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you’re using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.
The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.
Unless I’m mistaken, this still seems to be an issue in Kubernetes v1.28 (on EKS) with version v1.23.1 of the EBS CSI Driver. The following (albeit unrealistic) example reproduces the problem, by trying to send 26 pods in a statefulset with one PVC each to the same node. I would hope the scheduler wouldn’t do this, but instead on my nodes it gets to 24 pods and the 24th gets stuck in a pending state complaining that it can’t attach the volume. Is this likely to be fixed in an upcoming release? It’s causing us major problems. Obviously we’re not trying to send 26 pods with PVCs to the same node, but intermittently in our application the scheduler tries to schedule a pod with a PVC that won’t attach because the attachment quota has been breached, causing downtime and instability. Is there any workaround for this? Thanks in advance.
Hello,
… update from our side:
Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a ‘quick fix’):
… cordon the current node that the pod is stuck in ‘Init …’ on. … delete the pod … … verify that the pod is started successfully on an alternative node. If not … … repeat ‘cordoning’ until the pod is successfully deployed. … uncorden (all) node(s) upon successful deployment.
Then following a dive into the CSI EBS driver code, we passed the option ‘–volume-attach-limit=50’ to the ‘node driver’. I haven’t tested this explicitly yet however.
The problem to me seems to be a missing feedback loop between the ‘node driver’ and the scheduler.
The scheduler says, “Hey, there’s a node that satisfies my scheduling criteria … I’ll schedule the workload to run there …” and the node driver says, “OK, I have a workload but I’ve reached this ‘25 attached volumes’ limit so I’m done here …”.
This is just my perhaps primitive view of the situation.
Thanks,
Phil.
PS … following a re-deployment of the ‘csi ebs node driver’ we are still seeing the attribute ‘attachable-volumes-aws-ebs’ as set to 25 on a ‘describe node’:
… we weren’t expecting this.
@idanshaby The add-on schema has already been updated to include this parameter!
@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via
CSINodeobjects. if we are not rebuildingCSINodeobjects during redeploy of driver - that sounds like a bug. So setting--volume-attach-limitand redeploying driver should set correct limits.As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by
CSINodethen that would be a k/k bug (and we are going to need one).@stevehipwell thank your for your prompt and detailed reply and explanation. I have a clearer scope now.
Just to provide some clarity, I am trying to stress test my cluster with some workloads, through the use of Helm. Sadly those workloads require (the majority of them) an EBS volume attachment and this is when I ran into the error message,
AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-xxxxxxx" to node "i-xxxxxxx": attachment of disk "vol-xxxxxx" failed, expected device to be attached but was attachingwhich is what led me here in the first place.From reading both your replies and searching also the docs you recommended, I relised the following:
So within the context of my use case - when I try a lot of small workloads on an
m6aNode which is otherwise capable of supporting hundreds of Pods I am inevitably going to run into the issue of “running out” of available attachments if all my Pods require their own volume.To make matters worse a large number of small Pods all requiring IP addresses increases the amount of ENI attachments on my Node which further lowers my available EBS attachments.
So I could try bumping the version of the driver per your suggestion but if I got everything correctly , that wouldn’t help in what I am trying to do. The “sensible” thing is to use either larger workloads to fill-up my Node or workloads that don’t require EBS attachments in their entirety.
@sotiriougeorge I might be completely missing the point of your question here, and you really need to provide the actual version you’re using, but I’ll go on.
Firstly as the
m6ainstance is a Nitro instance you’re physically limited to 28 attachments (see docs) of which some might also be used for non-volume attachments. So I don’t see any value in you raising the limit to 70 as it’s not physically possible to attach more than 27 volumes (if attaching nothing else as they all need an ENI) to a Nitro instance.Secondly this thread (and others) covers the limitations in earlier version of this CSI driver specifically around detecting anything not a 5th generation instance as a Nitro instance and not calculating available attachments using the existing attachments. All of these are fixed in recent versions of the CSI driver. I’ve seen you’ve commented on https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1258 so you’re aware that you should discount the
attachable-volumes-aws-ebsvalue and look at theCSINodeinstead.My recommendation would be to update to the latest version of the driver and see how that works for you.
@jmhwang7
What I’m seeing is that calculation is much more complex than “28 for Nitro”, so suggest trying to lower the number of volumes to something like 23 (as naturally all our
csinodeobjects are 25 max right now on Nitro)@jmhwang7 nothing got out of there, they provided the exact “calculation” for nitro volumes that is already used in the driver. See their answer below:
We run and manage our own k8s clusters on top of EC2 instances, but we run the aws-ebs-csi-driver to manage ebs volumes.
We discovered a pretty gnarly bug in 1.10: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1361 . After setting the
volume-attach-limitas suggested in that issue, we started seeing exactly what @jortkoopmans detailed in this comment: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1163#issuecomment-1237132913 . Even for older nodes (multiple days/weeks old), when 2 pods get scheduled around the same time, only the N-1 pod, where N is the volume-attach-limit set, has its volume attached successfully and the Nth pod’s volume gets stuck in a “attaching” state.We initially thought this was due to a race condition in the Kubernetes scheduler, and so we lowered the manually set
volume-attach-limit. We are still getting paged/running into this issue despite the fact that the node has significantly less volumes attached than the ec2 instance can support (24 volumes + 1 eni = 25, 28 is the limit for the nitro instance, 24th volume gets stuck in attaching).@pkit did you have any luck with the support ticket you filed with AWS?
@jortkoopmans it looks like it’s exactly #1278 I’ve tested it too. It’s always the last two pods that fail if scheduled dynamically one after another. Essentially it’s a deal breaker for any dynamic allocation of pods.
@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn’t related to storage attachments. This issue wasn’t ever about the correct max value being set for attachments, that’s a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn’t make much sense.