autoscaler: CA failed to load Instance Type list unless configured with hostNetworking

Which component are you using?: cluster-autoscaler

What version of the component are you using?: Helm chart 9.10.8 cluster-autoscaler v1.21.1

Component version:

What k8s version are you using (kubectl version)?: v1.21

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS EKS

What did you expect to happen?: It will load instance type list normally and keep running.

What happened instead?: It keep CrashLoopBack and exit with error 255

How to reproduce it (as minimally and precisely as possible):

Set environment variable with AWS_REGION: ap-northeast-3

Anything else we need to know?:

Part of logs:

1112 07:23:25.974866       1 main.go:391] Cluster Autoscaler 1.21.1
I1112 07:23:25.996783       1 leaderelection.go:243] attempting to acquire leader lease kube-system/cluster-autoscaler...
I1112 07:23:26.016572       1 leaderelection.go:253] successfully acquired lease kube-system/cluster-autoscaler
I1112 07:23:26.016842       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"04f7e024-313b-4cd3-9e47-1bd8ab89d128", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"14162", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' hub-c-a-aws-cluster-autoscaler-fdb7d96d4-b9rg9 became leader
I1112 07:23:26.019206       1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1112 07:23:26.019328       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1112 07:23:26.020108       1 reflector.go:219] Starting reflector *v1.DaemonSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320
I1112 07:23:26.020220       1 reflector.go:255] Listing and watching *v1.DaemonSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:320
I1112 07:23:26.020557       1 reflector.go:219] Starting reflector *v1.ReplicationController (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329
I1112 07:23:26.020573       1 reflector.go:255] Listing and watching *v1.ReplicationController from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:329
I1112 07:23:26.020868       1 reflector.go:219] Starting reflector *v1.Job (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338
I1112 07:23:26.020883       1 reflector.go:255] Listing and watching *v1.Job from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:338
I1112 07:23:26.021148       1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1112 07:23:26.021242       1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212
I1112 07:23:26.021155       1 reflector.go:219] Starting reflector *v1.ReplicaSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:347
I1112 07:23:26.021494       1 reflector.go:255] Listing and watching *v1.ReplicaSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:347
I1112 07:23:26.021216       1 reflector.go:219] Starting reflector *v1.StatefulSet (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356
I1112 07:23:26.021667       1 reflector.go:255] Listing and watching *v1.StatefulSet from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:356
I1112 07:23:26.021267       1 reflector.go:219] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.021770       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.021279       1 reflector.go:219] Starting reflector *v1beta1.PodDisruptionBudget (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309
I1112 07:23:26.021938       1 reflector.go:255] Listing and watching *v1beta1.PodDisruptionBudget from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:309
I1112 07:23:26.021232       1 reflector.go:219] Starting reflector *v1.Node (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
I1112 07:23:26.022155       1 reflector.go:255] Listing and watching *v1.Node from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:246
W1112 07:23:26.040478       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
W1112 07:23:26.061120       1 warnings.go:70] policy/v1beta1 PodDisruptionBudget is deprecated in v1.21+, unavailable in v1.25+; use policy/v1 PodDisruptionBudget
I1112 07:23:26.067058       1 cloud_provider_builder.go:29] Building aws cloud provider.
F1112 07:23:26.067164       1 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list

goroutine 61 [running]:
k8s.io/klog/v2.stacks(0xc0000c2001, 0xc0009fe000, 0x8a, 0xee)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1021 +0xb8
k8s.io/klog/v2.(*loggingT).output(0x629d5a0, 0xc000000003, 0x0, 0x0, 0xc00004c230, 0x61ad5f1, 0x15, 0x16d, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:970 +0x1a3
k8s.io/klog/v2.(*loggingT).printf(0x629d5a0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x3e68953, 0x2d, 0xc001044900, 0x1, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:751 +0x18b
k8s.io/klog/v2.Fatalf(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1509
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws.BuildAWS(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go:365 +0x290
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.buildCloudProvider(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/builder_all.go:69 +0x18f
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.NewCloudProvider(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/cloud_provider_builder.go:45 +0x1e6
k8s.io/autoscaler/cluster-autoscaler/core.initializeDefaultOptions(0xc0010076e0, 0x4530301, 0x8)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:101 +0x2fd
k8s.io/autoscaler/cluster-autoscaler/core.NewAutoscaler(0x3fe0000000000000, 0x3fe0000000000000, 0x8bb2c97000, 0x1176592e000, 0xa, 0x0, 0x4e200, 0x0, 0x186a0000000000, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler.go:65 +0x43
main.buildAutoscaler(0x972073, 0xc000634f50, 0x457dc20, 0xc00039d500)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:337 +0x368
main.run(0xc00007efa0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:343 +0x39
main.main.func2(0x453c8a0, 0xc0000c9b00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:447 +0x2a
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:207 +0x113

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e77c00, 0x44cea80, 0xc000311620, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008bfc00, 0x77359400, 0x0, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90

goroutine 1 [select]:
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000e77c00, 0x44cea80, 0xc000311620, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:167 +0x13f
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0008bfc00, 0x77359400, 0x0, 0xc0000c9b01, 0xc000056c00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90
k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew(0xc0001bf320, 0x453c8a0, 0xc0000c9b40)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:263 +0x107
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0001bf320, 0x453c8a0, 0xc0000c9b00)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:208 +0x13b
k8s.io/client-go/tools/leaderelection.RunOrDie(0x453c8e0, 0xc0000ae008, 0x4571bc0, 0xc00092eb40, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00069d8e0, 0x3f40d28, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:222 +0x96
main.main()
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:438 +0x829

goroutine 18 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x629d5a0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/klog/v2/klog.go:418 +0xdd

goroutine 48 [runnable]:
sync.runtime_SemacquireMutex(0xc0000a1a44, 0xc000966c00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0000a1a40)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
sync.(*Map).Load(0xc0000a1a40, 0x339f5a0, 0xc000966d38, 0xc000c442f8, 0x5a9fc18f48a93701, 0x5a0000000040c8f4)
	/usr/local/go/src/sync/map.go:106 +0x2c4
github.com/modern-go/reflect2.(*frozenConfig).Type2(0xc00009d180, 0x45acfa0, 0xc000e3a540, 0x3711f40, 0xc000966f00)

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 2
Comments: 20 (8 by maintainers)

Most upvoted comments

What’s the solution facing the same issue with EKS 1.24? Cluster is public while CA trying to access sts which the public getting timeout

F0208 15:16:19.159661       1 aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.us-west-1.amazonaws.com/": dial tcp 176.32.114.104:443: i/o timeout

MageshSrinivasulu on Feb 8, 2023

In my case, the cluster-autoscaler pod fails accessing the public AWS sts service endpoint via its public IP:

F0906 08:47:57.077390       1 aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: WebIdentityErr: failed to retrieve credentials
caused by: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp 54.xxx.xxx..25:443: i/o timeout

My EKS is a private cluster, with a private VPC sts interface endpoint configured, like this:

	  "sts" = {
			"dns_name" = "sts.eu-west-1.amazonaws.com"
			"hosted_zone_id" = "ZXXXXX"
		  },
```.
I believe after I have all the things fixed, it should resolve ``sts.amazonaws.com`` into its regional cname ``sts.eu-west-1.amazonaws.com`` into a private subnet IP and access it via the worker host's ENI interface...

bogdando on Sep 6, 2022

We have the same issue in Ireland eu-west-1 region. Which component are you using?: cluster-autoscaler

What version of the component are you using?: cluster-autoscaler v1.21.1

Component version: What k8s version are you using (kubectl version)?: v1.21 kubectl version Client Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-13+d2965f0db10712", GitCommit:"d2965f0db1071203c6f5bc662c2827c71fc8b20d", GitTreeState:"clean", BuildDate:"2021-06-26T01:02:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What happened instead?: It keep CrashLoopBack kube-system pod/cluster-autoscaler-79475c6789-tnljd 0/1 CrashLoopBackOff 9

Logs W1129 1 aws_util.go:84] Error fetching https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json skipping... Get "https://pricing.us-east-1.amazonaws.com/offers/v1.0/aws/AmazonEC2/current/eu-west-1/index.json": dial tcp: i/o timeout F1129 aws_cloud_provider.go:365] Failed to generate AWS EC2 Instance Types: unable to load EC2 Instance Type list goroutine 32

Troubleshooting If I added --aws-use-static-instance-list=true to CA it is running some time, kube-system pod/cluster-autoscaler-cc975695c-rwlzv 1/1 Running 2 5m3s but crashed again after with log: E1129 17:59:44.241301 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.eu-west-1.amazonaws.com/": dial tcp: i/o timeout F1129 17:59:44.241348 1 aws_cloud_provider.go:389] Failed to create AWS Manager: cannot autodiscover ASGs: RequestError: send request failed caused by: Post "https://autoscaling.eu-west-1.amazonaws.com/": dial tcp: i/o timeout goroutine 71 [running]:

Vadim-Zenin on Nov 29, 2021

Relatedly, it would be great to get your feedback as users who have encountered this, on the change I’m proposing in #4480, would you prefer that behaviour, with the risk I’ve outlined in the PR description, over the current hard crash behaviour?

Yeah I think that is a reasonable change although I’m not sure it solves the specific issue as in my case, falling back to that static list still resulted in fatal crashing as it attempted to access resources outside the cluster elsehwere.

What I might propose is an obvious check (as it seemingly is a requirement of the cluster autoscaler here, not sure if it is AWS specific or not though) that the pod the cluster autoscaler is running in has access to external resources outside the cluster (e.g. can access the internet) and if it can’t, error with an explicit message that is seemingly less cryptic than the ones noted above.

E.g.

// check if we can reach amazon.com/google.com/some resource, dns location whatever.
// if successful, proposed PR above should handle specific cases permissions might be a concern
// if failed, error gracefully with a specific message telling the user the cluster autoscaler cannot access the internet to retrieve necessary resources

… Hope that makes sense 😃

To add some more context, when I was attempting to debug the issue I had specifically, seeing messages of ‘timeout’ I was unsure if the context deadline was being hit as a result of latency. If the endpoint data was so big that again it was timing out. If the timeout was permission related and kept trying until again the context deadline exceeded. (It’s not a normal perception that your thing in the cloud can’t reach the cloud 😃 )

dan-tw on Nov 29, 2021