autoscaler: Cluster autoscaler version 1.16.0 doesn't notice pending pods
We’ve been using cluster autoscaler version 1.15.0 patched with #2008 for some time in AWS, to good effect. Today we attempted to put the new version 1.16.0 into service using all the same configuration, and found that it no longer seems to notice pending pods.
The cluster autoscaler starts fine, and the logs don’t indicate anything failing. It goes through its periodic “main loop” and the “Regenerating instance to ASG map for ASGs” step regularly, again without any obvious problems. However, when we create pods that require that the cluster autoscaler take note and adjust a suitable ASG’s size, the cluster autoscaler’s logs don’t show any evidence of it noticing these pods. In prior versions, we see messages to the following effect:
- Pod ns/name is unschedulable
- Pod name be scheduled on node, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient nvidia.com/gpu
- Event(v1.ObjectReference{Kind:“Pod”, Namespace:“ns”, Name:“name”, UID:“ae32b9cc-081e-4323-ba33-7810457a0ddf”, APIVersion:“v1”, ResourceVersion:“58735432”, FieldPath:“”}): type: ‘Normal’ reason: ‘TriggeredScaleUp’ pod triggered scale-up: [{asg 0->13 (max: 125)}]
Instead, the new cluster autoscaler exhibits no reaction to these pending pods.
Here are the flag values reported at start time:
flags.go:52] FLAG: --add-dir-header="false"
flags.go:52] FLAG: --address=":8085"
flags.go:52] FLAG: --alsologtostderr="false"
flags.go:52] FLAG: --balance-similar-node-groups="false"
flags.go:52] FLAG: --cloud-config=""
flags.go:52] FLAG: --cloud-provider="aws"
flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
flags.go:52] FLAG: --cluster-name=""
flags.go:52] FLAG: --cores-total="0:320000"
flags.go:52] FLAG: --estimator="binpacking"
flags.go:52] FLAG: --expander="least-waste"
flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
flags.go:52] FLAG: --gpu-total="[]"
flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
flags.go:52] FLAG: --ignore-taint="[]"
flags.go:52] FLAG: --kubeconfig=""
flags.go:52] FLAG: --kubernetes=""
flags.go:52] FLAG: --leader-elect="true"
flags.go:52] FLAG: --leader-elect-lease-duration="15s"
flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
flags.go:52] FLAG: --leader-elect-resource-name=""
flags.go:52] FLAG: --leader-elect-resource-namespace=""
flags.go:52] FLAG: --leader-elect-retry-period="2s"
flags.go:52] FLAG: --log-backtrace-at=":0"
flags.go:52] FLAG: --log-dir=""
flags.go:52] FLAG: --log-file=""
flags.go:52] FLAG: --log-file-max-size="1800"
flags.go:52] FLAG: --logtostderr="true"
flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
flags.go:52] FLAG: --max-empty-bulk-delete="10"
flags.go:52] FLAG: --max-failing-time="15m0s"
flags.go:52] FLAG: --max-graceful-termination-sec="600"
flags.go:52] FLAG: --max-inactivity="10m0s"
flags.go:52] FLAG: --max-node-provision-time="3m0s"
flags.go:52] FLAG: --max-nodes-total="0"
flags.go:52] FLAG: --max-total-unready-percentage="45"
flags.go:52] FLAG: --memory-total="0:6400000"
flags.go:52] FLAG: --min-replica-count="0"
flags.go:52] FLAG: --namespace="our-system"
flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=kubernetes.io/cluster-autoscaler/enabled,kubernetes.io/cluster/redacted]"
flags.go:52] FLAG: --nodes="[]"
flags.go:52] FLAG: --ok-total-unready-count="3"
flags.go:52] FLAG: --regional="false"
flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
flags.go:52] FLAG: --scale-down-delay-after-add="3m0s"
flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
flags.go:52] FLAG: --scale-down-enabled="true"
flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
flags.go:52] FLAG: --scale-down-non-empty-candidates-count="50"
flags.go:52] FLAG: --scale-down-unneeded-time="13m0s"
flags.go:52] FLAG: --scale-down-unready-time="7m0s"
flags.go:52] FLAG: --scale-down-utilization-threshold="0.5"
flags.go:52] FLAG: --scale-up-from-zero="true"
flags.go:52] FLAG: --scan-interval="10s"
flags.go:52] FLAG: --skip-headers="false"
flags.go:52] FLAG: --skip-log-headers="false"
flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
flags.go:52] FLAG: --skip-nodes-with-system-pods="true"
flags.go:52] FLAG: --stderrthreshold="0"
flags.go:52] FLAG: --test.bench=""
flags.go:52] FLAG: --test.benchmem="false"
flags.go:52] FLAG: --test.benchtime="1s"
flags.go:52] FLAG: --test.blockprofile=""
flags.go:52] FLAG: --test.blockprofilerate="1"
flags.go:52] FLAG: --test.count="1"
flags.go:52] FLAG: --test.coverprofile=""
flags.go:52] FLAG: --test.cpu=""
flags.go:52] FLAG: --test.cpuprofile=""
flags.go:52] FLAG: --test.failfast="false"
flags.go:52] FLAG: --test.list=""
flags.go:52] FLAG: --test.memprofile=""
flags.go:52] FLAG: --test.memprofilerate="0"
flags.go:52] FLAG: --test.mutexprofile=""
flags.go:52] FLAG: --test.mutexprofilefraction="1"
flags.go:52] FLAG: --test.outputdir=""
flags.go:52] FLAG: --test.parallel="8"
flags.go:52] FLAG: --test.run=""
flags.go:52] FLAG: --test.short="false"
flags.go:52] FLAG: --test.testlogfile=""
flags.go:52] FLAG: --test.timeout="0s"
flags.go:52] FLAG: --test.trace=""
flags.go:52] FLAG: --test.v="false"
flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
flags.go:52] FLAG: --v="4"
flags.go:52] FLAG: --vmodule=""
flags.go:52] FLAG: --write-status-configmap="true"
main.go:363] Cluster Autoscaler 1.16.0
I didn’t see any other issues complaining of this problem. Given that this release has been out for seven days now, I assume someone else would have run into this same problem.
Reverting to our previous patched container image worked fine, but we’d like to move forward. Is there some new configuration that we need to adjust in order to restore the previous behavior of the cluster autoscaler?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 26 (14 by maintainers)
Hello @seh @losipiuk , I am again facing issue with liveness probe while using image k8s.gcr.io/autoscaling/cluster-autoscaler:v1.16.7 Thanks for checking it again ,