kubernetes: When a new node joins the cluster - scheduler doesn't respect CSI volume limit

What happened: I’m using the aws-ebs-csi-driver for attaching pre-provisioned EBS disks to my pods. I’ve also customized the max attach limit to 1 (for testing purposes) but I intend to use a low value (around 5-8) in production. The attach limit value is respected by the scheduler and pods usually are pending until volumes are detached.

The issue I’m having is when I’m adding a new node (using cluster-autoscaler or manually) what happens is as soon as the csi-driver node is live on the new node all pending pods are scheduled to the new node ignoring the attach limit and exceeding it.

From my testing I got up to 10 pods to schedule which exceed the limit (1)

After some learning about the CSI Driver and actual process I believe the issue is either in the scheduler or the controller. They probably start scheduling pods to the new node before the CSI Driver configuration completed publishing all relevant values to the cluster.

My current workaround is have all my pods have an affinity to foo label and 30 seconds after the CSI node pod is up I patch the node with the foo label. That way I neutralized the race condition but obviously this isn’t a very good solution

What you expected to happen: We need to make sure the scheduler won’t start scheduling pods until it is certain the node doesn’t have max attach limit for the relevant CSI volumes.

How to reproduce it (as minimally and precisely as possible):

  1. Install the ebs csi driver with the flag --volume-attach-limit=1
  2. Example pod + pv + pvc attributes:
apiVersion: v1
kind: PersistentVolume
metadata:
  name: pod1-pv-new
spec:
  accessModes:
    - ReadWriteOnce
  capacity:
    storage: 20Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pod1-pvc
    namespace: default
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: vol-003ee7a245e6b2cd2 # note that I'm using a pre-provisioned volume
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.kubernetes.io/zone
              operator: In
              values:
                - ca-central-1a
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-sc
  volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pod1-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: csi-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
# This is the label I'm adding to new nodes after CSI node pod is up
#          - matchExpressions:
#              - key: yarin-test
#                operator: Exists
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - ca-central-1a
  containers:
    - image: ubuntu:latest
      # Just spin & wait forever
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      name: test-container
      volumeMounts:
        - mountPath: /test-ebs
          name: persistent-storage
  volumes:
    - name: persistent-storage
      persistentVolumeClaim:
        claimName: pod1-pvc

Storage class:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: csi-sc
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete
  1. Apply the podX.yaml several times with different volumes
  2. Make sure all pods except one is in “Pending” state due to attach limit enforcement
  3. Add a new node to the cluster
  4. Once the daemonset is joined you can see all the other pending pods are assigned to the new node.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.17
  • Cloud provider or hardware configuration: eks on AWS
  • I’m using the latest csi-provisioner/csi-attacher/node-registrar

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 5
  • Comments: 27 (17 by maintainers)

Most upvoted comments

/cc