autoscaler: Scale up from 0 does not work with existing AWS EBS CSI PersistentVolume

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

v1.18.3 ( also happened with v1.18.2)

Cluster-Autoscaler Deployment YAML

---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::AWS_ACCOUNT_ID_OMMITTED:role/mycompany-iam-k8s-cluster-autoscaler-test
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses", "csinodes"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["batch", "extensions"]
    resources: ["jobs"]
    verbs: ["get", "list", "watch", "patch"]
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["create"]
  - apiGroups: ["coordination.k8s.io"]
    resourceNames: ["cluster-autoscaler"]
    resources: ["leases"]
    verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create", "list", "watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
    verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      serviceAccountName: cluster-autoscaler
      priorityClassName: cluster-critical
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.18.3 #Major & Minor should match cluster version: https://docs.aws.amazon.com/de_de/eks/latest/userguide/cluster-autoscaler.html#ca-deploy
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/mycompany-test-eks
            - --ignore-daemonsets-utilization=true
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --balance-similar-node-groups=false
            - --min-replica-count=0
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt
              readOnly: true
          imagePullPolicy: "Always"
      volumes:
        - name: ssl-certs
          hostPath:
            path: "/etc/ssl/certs/ca-bundle.crt"

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS EKS with multiple ASGs
https://github.com/kubernetes-sigs/aws-ebs-csi-driver installed via helm chart v 0.8.2

What did you expect to happen?: I do have an ASG dedicated to a single CronJob, that get’s triggered 6 times a day. That ASG is pinned to a specific AWS AZ by it’s assigned subnet. The Cronjob is pinned to that specific ASG by Affinity+Toleration The job uses a PV, that will be provisioned (AWS EBS) on the first ever run and then subsequently reused on each run. I expect the ASG to be scaled up to 1 after the Pod gets created and removed shortly after the Pod/Job has finished.

What happened instead?:

The ASG will not be scaled up by the cluster-autoscaler.

cluster-autoscaler log output after the Job is created and the Pod is pending

2021-01-25T05:19:22.523Z : Starting main loop			
2021-01-25T05:19:22.524Z : "Found multiple availability zones for ASG "mycompany-test-eks-myapp-elastic-group-1-20210108154118845300000003"	 using eu-central-1a"		
2021-01-25T05:19:22.525Z : "Found multiple availability zones for ASG "mycompany-test-eks-myapp-worker-group-2-20201029130225136800000004"	 using eu-central-1a"		
2021-01-25T05:19:22.525Z : "Found multiple availability zones for ASG "mycompany-test-eks-worker-group-1-20201029130715836900000005"	 using eu-central-1a"		
2021-01-25T05:19:22.526Z : Filtering out schedulables			
2021-01-25T05:19:22.526Z : 0 pods marked as unschedulable can be scheduled.			
2021-01-25T05:19:22.526Z : No schedulable pods			
2021-01-25T05:19:22.526Z : Pod myapp-masterdata/masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw is unschedulable			
2021-01-25T05:19:22.526Z : Upcoming 0 nodes			
2021-01-25T05:19:22.526Z : Skipping node group mycompany-test-eks-myapp-elastic-group-1-20210108154118845300000003 - max size reached			
2021-01-25T05:19:22.526Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-elastic-group-2-20201029130715759300000004, predicate checking error: node(s) didn't match node selector	 predicateName=NodeAffinity	 reasons: node(s) didn't match node selector	 debugInfo="
2021-01-25T05:19:22.526Z : No pod can fit to mycompany-test-eks-myapp-elastic-group-2-20201029130715759300000004			
2021-01-25T05:19:22.526Z : "Could not get a CSINode object for the node "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836": csinode.storage.k8s.io "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836" not found"			
2021-01-25T05:19:22.527Z : "PersistentVolume "pvc-ef85dcce-e63e-42da-b869-c3389bbd948d", Node "template-node-for-mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003-8426967936887117836" mismatch for Pod "myapp-masterdata/masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw": No matching NodeSelectorTerms"			
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003, predicate checking error: node(s) had volume node affinity conflict	 predicateName=VolumeBinding	 reasons: node(s) had volume node affinity conflict	 debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-myapp-masterdata-import-20210120105639236000000003			
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-myapp-worker-group-120200916154409048800000006, predicate checking error: node(s) didn't match node selector	 predicateName=NodeAffinity	 reasons: node(s) didn't match node selector	 debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-myapp-worker-group-120200916154409048800000006			
2021-01-25T05:19:22.527Z : Skipping node group mycompany-test-eks-myapp-worker-group-2-20201029130225136800000004 - max size reached			
2021-01-25T05:19:22.527Z : Skipping node group mycompany-test-eks-worker-group-1-20201029130715836900000005 - max size reached			
2021-01-25T05:19:22.527Z : "Pod masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw can't be scheduled on mycompany-test-eks-worker-group-220200916162252020100000006, predicate checking error: node(s) didn't match node selector	 predicateName=NodeAffinity	 reasons: node(s) didn't match node selector	 debugInfo="
2021-01-25T05:19:22.527Z : No pod can fit to mycompany-test-eks-worker-group-220200916162252020100000006			
2021-01-25T05:19:22.527Z : No expansion options			
2021-01-25T05:19:22.527Z : Calculating unneeded nodes			
[...]
2021-01-25T05:19:22.528Z : Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s			
2021-01-25T05:19:22.528Z : Scale down status: unneededOnly=false lastScaleUpTime=2021-01-25 05:00:14.980160831 +0000 UTC m=+6970.760701246 lastScaleDownDeleteTime=2021-01-25 03:04:22.928996296 +0000 UTC m=+18.709536671 lastScaleDownFailTime=2021-01-25 03:04:22.928996376 +0000 UTC m=+18.709536751 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false			
2021-01-25T05:19:22.528Z : Starting scale down			
2021-01-25T05:19:22.528Z : No candidates for scale down			
2021-01-25T05:19:22.528Z : "Event(v1.ObjectReference{Kind:"Pod", Namespace:"myapp-masterdata", Name:"masterdata-import-cronjob-lambda-d0ad7add-e9b0-424e-94dc-0wbrzw", UID:"97956c38-55f3-4749-ab74-7e7fc674e832", APIVersion:"v1", ResourceVersion:"217276797", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 max node group size reached, 3 node(s) didn't match node selector, 1 node(s) had volume node affinity conflict"			
2021-01-25T05:19:22.946Z : k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309: Watch close - *v1beta1.PodDisruptionBudget total 0 items received			
2021-01-25T05:19:32.542Z : Starting main loop

Anything else we need to know?: Basically this works fine without the volume. With the volume it works when the volume is not provisioned yet, but fails when it already has been provisioned. The job also get’s scheduled right away when I manually upscale the ASG.

I noticed the volume affinity on the PVC :

Node Affinity:                                                                                                                                │
  Required Terms:                                                                                                                             │
    Term 0:        topology.ebs.csi.aws.com/zone in [eu-central-1b]

That tag is probably set on the node by the “ebs-csi-node” DaemonSet and therefore is unknown for the cluster-autoscaler.

Am I expected to tag the ASG with k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone ? If so, how am I supposed to set them in a Multi-AZ ASGs ?

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 13
Comments: 26 (2 by maintainers)

Most upvoted comments

Yes, but when the ASG is at 0, there are no nodes. cluster-autoscaler needs the labels tagged on the ASG to know what labels the node would have if it would scale up the ASG from 0.

jbg on Oct 19, 2022

k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone is the approach I am taking and it works like charm.

I can do some footwork in terraform to get the tags setup. Not sure what you’re using to provision your cluster.

Though, it would be nice to have the labels generated from the list of AZs assigned to an ASG

westernspion on Feb 5, 2021

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot on Jun 4, 2023

Also, from your comment, what do you mean by When your ASG is at 0? You mean if I set the desired count to be ‘0’?

decipher27 on Sep 29, 2022

@FarhanSajid1 you should have one node group (and thus one ASG) for each AZ. The above tag needs to be applied to the ASG.

jbg on Mar 29, 2022