instance-manager: Cluster Autoscaler unable to scale up nodes from EKS autoscaling group warm pool when pods request ephemeral storage

Is this a BUG REPORT or FEATURE REQUEST?:

A bug report.

What happened:

Cluster Autoscaler is unable to scale up nodes from warm pools of an InstanceGroup with the eks provisioner when the unscheduled pods request additional ephemeral storage:

I0215 16:21:32.873817       1 klogx.go:86] Pod nodegroup-runners/iris-kdr5j-mhntj is unschedulable
I0215 16:21:32.873855       1 scale_up.go:376] Upcoming 1 nodes
I0215 16:21:32.873910       1 scale_up.go:300] Pod iris-kdr5j-mhntj can't be scheduled on eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0215 16:21:32.873939       1 scale_up.go:449] No pod can fit to eks-EksClusterNodegroupDefaultM-lckreA32Rf3D-a6bf610c-e30e-48d5-e342-47ed2155eac7
I0215 16:21:32.873989       1 scale_up.go:300] Pod iris-kdr5j-mhntj can't be scheduled on tm-github-runners-cluster-nodegroup-runners-iris, predicate checking error: Insufficient ephemeral-storage; predicateName=NodeResourcesFit; reasons: Insufficient ephemeral-storage; debugInfo=
I0215 16:21:32.874023       1 scale_up.go:449] No pod can fit to tm-github-runners-cluster-nodegroup-runners-iris
I0215 16:21:32.874050       1 scale_up.go:453] No expansion options

This is the configured InstanceGroup in question:

apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
  name: iris
  namespace: nodegroup-runners
  annotations:
   instancemgr.keikoproj.io/cluster-autoscaler-enabled: 'true'
spec:
  strategy:
    type: rollingUpdate
    rollingUpdate:
      maxUnavailable: 5
  provisioner: eks
  eks:
    minSize: 1
    maxSize: 10
    warmPool:
      minSize: 0
      maxSize: 10
    configuration:
      labels:
        workload: runners
      keyPairName: iris-github-runner-keypair
      clusterName: tm-github-runners-cluster
      image: ami-0778893a848813e52
      instanceType: c6i.2xlarge

I’m attempting to deploy pods via this RunnerDeployment from the actions-runner-controller. If I remove the ephemeral-storage requests & limits, then Cluster Autoscaler is able to scale up nodes from the warm pool as expected.

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: iris
  namespace: nodegroup-runners
spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload
                operator: In
                values:
                - runners
      ephemeral: true
      repository: MyOrg/iris
      labels:
        - self-hosted
      dockerEnabled: false
      image: ghcr.io/my-org/self-hosted-runners/iris:v7
      imagePullSecrets:
        - name: github-container-registry
      containers:
        - name: runner
          imagePullPolicy: IfNotPresent
          env:
            - name: RUNNER_FEATURE_FLAG_EPHEMERAL
              value: "true"
          resources:
            requests:
              cpu: "1.0"
              memory: "2Gi"
              ephemeral-storage: "10Gi"
            limits:
              cpu: "2.0"
              memory: "4Gi"
              ephemeral-storage: "10Gi"

What you expected to happen:

For Cluster Autoscaler to be able to scale-in nodes from the warm pool when there’s unscheduled pods that request ephemeral storage.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy the above instance group (using your own subnet, cluster, security groups) into a cluster with Cluster Autoscaler in it

  2. Deploy any pods with the appropriate node affinity and request limits for ephemeral storage.

  3. Check cluster autoscaler logs and note the failure to scale up

Environment:

  • Kubernetes version: 1.21
$ kubectl version -o yaml
clientVersion:
  buildDate: "2021-08-19T15:45:37Z"
  compiler: gc
  gitCommit: 632ed300f2c34f6d6d15ca4cef3d3c7073412212
  gitTreeState: clean
  gitVersion: v1.22.1
  goVersion: go1.16.7
  major: "1"
  minor: "22"
  platform: darwin/amd64
serverVersion:
  buildDate: "2021-10-29T23:32:16Z"
  compiler: gc
  gitCommit: 5236faf39f1b7a7dabea8df12726f25608131aa9
  gitTreeState: clean
  gitVersion: v1.21.5-eks-bc4871b
  goVersion: go1.16.8
  major: "1"
  minor: 21+
  platform: linux/amd64

Other debugging information (if applicable):

None available as we’ve deprovisioned our test setup for now but if need be we can reproduce and post additional logs here.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 26

Most upvoted comments

I think this is specifically a ‘scale from zero’ problem - where CA uses only the tags on the ASG to determine the capacity of a potential new node. I believe this is possible to implement in Instance-Manager - we’d have to inspect the volumes attached to the IG, determine which volume is the volume used for ephemeral storage, and tag the ASG accordingly.

@backjo Do we want to implement some fix for this? I think this would be slightly problematic considering you don’t know which volume the application will use… e.g. Someone could be attaching EBS volumes, or use the instance store, the storage size would be dynamic from the perspective of the controller. The workaround of adding a tag to IG spec is actually pretty good:

k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: ?

We could have an annotation to add this tag, but that might be a bit redundant.

Any thoughts?

I’m not sure that we’ll be able to be correct enough to implement this well. As you mention, we generally will not know exactly which volume the Node has configured for ephemeral storage. I think the workaround here is suitable for folks. Maybe we could update the documentation around the cluster autoscaler annotation, though, to help awareness of the limitation here.

Great sounds good, thanks for the pointer. I’ll dig through cloudwatch and let you know what I find.

In this case you need to understand why the nodes are not joining the cluster, the instance-manager controller was also showing this in the logs:

controllers.instancegroup.eks   desired nodes are not ready     {"instancegroup": "nodegroup-runners/iris", "instances": "i-0616a2e0e0d595ddc,i-09d55075b02cbc6b6"}

I would look at the node’s kubelet logs & control plane logs to see why they are not joining (you can probably find those in cloudwatch logs & by SSHing to the nodes for the kubelet logs)

Went over the pictures you added again - are you saying the instances that are spun up in the live pool are not joining the cluster? The instances in the live pool should definitely join the cluster, so there could be another issue here

The warm pool instances are not supposed to be part of the cluster, they are spun up and shut-down while in the warm pool. Later, when autoscaling happen instead of spinning up new instances, instances from the warm pool simply move to the ‘live’ pool of instances and are powered on. The ‘faster’ scaling here is because we only spend time waiting for the instance to boot instead of boot+provisioned.

As long as you see the instances moving from the warm pool to the main pool when scaling happens this part should be fine… you can also test the same thing without warm pools to exclude it.

Great, if nodes scale out now, we have an approach for adding support for autoscaler as part of the cluster-autoscaler-enabled annotation. (thanks @backjo 😀 )

If you now have a scheduling issue, can you compare the affinity to make sure it’s matching e.g. pod affinity vs. node labels.

Would also be good to experiment with tagging the ASG via configuration.tags with k8s.io/cluster-autoscaler/node-template/resources/ephemeral-storage: "xGi" (replace x with the volume size on your node)