autoscaler: Scale-down causes downtime for the pods to be moved in another node (aws EKS)

I have set up K8S cluster using EKS. CA has been configured to increase/decrease the number of nodes based on resources availability for pods. During scale-down, the CA terminates a node before moving pods in the node on another node. So, the pods get scheduled on another node after the node gets terminated. Hence, There is some downtime until the re-scheduled pods become healthy on another node.

How can I avoid the downtime by ensuring that the pods get scheduled on another node before the node gets terminated?

Deployment :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/cluster-autoscaler:v1.12.3
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
            - --balance-similar-node-groups=true
          env:
            - name: AWS_REGION
              value: eu-central-1
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/kubernetes/pki/ca.crt"

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 49 (14 by maintainers)

Most upvoted comments

Please, can someone explain how CA can drain a specific node upfront before AWS ASG decides which node exactly to terminate on scale-down event ?

e.g. when CA changes the desired nodes, e.g. from 3 to 2, and then ASG starts to terminate a random node - how does CA know which node to drain ?

I got few days off and just see the message. As Alexa said, “CA always requests deleting a specific node.” ASG won’t randomly terminate node. Instead, CA drain specific node and ASG terminate node by node id.

The root problem of this issue is CA use evict API rather than Drain API but service controller uses a NotScheduable label to remove a node from load balancer endpoints. Only drain API will label node, evict API will not do that.

Are there any plans for this issue or any workarounds that don’t involve downtime?

I have an EKS cluster (v1.22) with cluster-autoscaler installed, where the most common/important workload is an app that needs to run as a single replica Deployment and has a startup sequence which takes about a minute. This essentially means that any rebalancing/scaling-down action by CA leads to downtime of the said app (about equal to the Readiness probe delay used). Unfortunately, PDBs cannot be used to solve this particular case, as they just lead to unremovable nodes.

There is no workaround I can think of and I don’t think there will be anytime soon - you can’t possibly remove a node without stopping the pods running on it. So your options are really either prevent scale-down of nodes that run single pods (using PDB, annotation, etc) which will hurt utilization or you can run multiple replicas of each app (along with PDB) to allow scale-down without disruption.

CA doesn’t have a functionality to add a replacement pod before deleting a node and I don’t think we’re going to add it in predictable future (probably never). Fundamentally CA operates on pods and nodes, it doesn’t really have an abstraction for a deployment or any other collection of pods. It would probably require huge changes to add something like this and it may be very hard to do without crippling CA performance in very large clusters.

Relevant API call: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go#L242

@Jeffwan can you explain what’s going on here? This doesn’t sound like a Cluster Autoscaler issue, right?