kubernetes: When a new node joins the cluster - scheduler doesn't respect CSI volume limit
What happened: I’m using the aws-ebs-csi-driver for attaching pre-provisioned EBS disks to my pods. I’ve also customized the max attach limit to 1 (for testing purposes) but I intend to use a low value (around 5-8) in production. The attach limit value is respected by the scheduler and pods usually are pending until volumes are detached.
The issue I’m having is when I’m adding a new node (using cluster-autoscaler or manually) what happens is as soon as the csi-driver node is live on the new node all pending pods are scheduled to the new node ignoring the attach limit and exceeding it.
From my testing I got up to 10 pods to schedule which exceed the limit (1)
After some learning about the CSI Driver and actual process I believe the issue is either in the scheduler or the controller. They probably start scheduling pods to the new node before the CSI Driver configuration completed publishing all relevant values to the cluster.
My current workaround is have all my pods have an affinity to foo label and 30 seconds after the CSI node pod is up I patch the node with the foo label. That way I neutralized the race condition but obviously this isn’t a very good solution
What you expected to happen: We need to make sure the scheduler won’t start scheduling pods until it is certain the node doesn’t have max attach limit for the relevant CSI volumes.
How to reproduce it (as minimally and precisely as possible):
- Install the ebs csi driver with the flag
--volume-attach-limit=1 - Example pod + pv + pvc attributes:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pod1-pv-new
spec:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: pod1-pvc
namespace: default
csi:
driver: ebs.csi.aws.com
volumeHandle: vol-003ee7a245e6b2cd2 # note that I'm using a pre-provisioned volume
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ca-central-1a
persistentVolumeReclaimPolicy: Delete
storageClassName: csi-sc
volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pod1-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: csi-sc
---
apiVersion: v1
kind: Pod
metadata:
name: pod1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
# This is the label I'm adding to new nodes after CSI node pod is up
# - matchExpressions:
# - key: yarin-test
# operator: Exists
- key: topology.kubernetes.io/zone
operator: In
values:
- ca-central-1a
containers:
- image: ubuntu:latest
# Just spin & wait forever
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
name: test-container
volumeMounts:
- mountPath: /test-ebs
name: persistent-storage
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: pod1-pvc
Storage class:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: csi-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
- Apply the podX.yaml several times with different volumes
- Make sure all pods except one is in “Pending” state due to attach limit enforcement
- Add a new node to the cluster
- Once the daemonset is joined you can see all the other pending pods are assigned to the new node.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): 1.17 - Cloud provider or hardware configuration: eks on AWS
- I’m using the latest csi-provisioner/csi-attacher/node-registrar
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 5
- Comments: 27 (17 by maintainers)
/cc