kops: Problem Deploying Autoscaler with v1.5.1

Using Kops v1.5.0-beta2, if I deploy the Cluster Autoscaler as described here on AWS, it appears to fail. Here’s exactly what I ran:

CLOUD_PROVIDER=aws
IMAGE=gcr.io/google_containers/cluster-autoscaler:v0.4.0
MIN_NODES=3
MAX_NODES=24
AWS_REGION=us-east-1
GROUP_NAME="k8s-worker"
SSL_CERT_PATH="/etc/ssl/certs/ca-certificates.crt" # (/etc/ssl/certs for gce)

addon=cluster-autoscaler.yml
wget -O ${addon} https://raw.githubusercontent.com/kubernetes/kops/master/addons/cluster-autoscaler/v1.4.0.yaml

sed -i -e "s@{{CLOUD_PROVIDER}}@${CLOUD_PROVIDER}@g" "${addon}"
sed -i -e "s@{{IMAGE}}@${IMAGE}@g" "${addon}"
sed -i -e "s@{{MIN_NODES}}@${MIN_NODES}@g" "${addon}"
sed -i -e "s@{{MAX_NODES}}@${MAX_NODES}@g" "${addon}"
sed -i -e "s@{{GROUP_NAME}}@${GROUP_NAME}@g" "${addon}"
sed -i -e "s@{{AWS_REGION}}@${AWS_REGION}@g" "${addon}"
sed -i -e "s@{{SSL_CERT_PATH}}@${SSL_CERT_PATH}@g" "${addon}"

kubectl apply -f ${addon}

Here is the log from the pod itself:

2017-02-06T21:53:09.516651243Z I0206 21:53:09.516516       1 cluster_autoscaler.go:353] Cluster Autoscaler 0.4.0
2017-02-06T21:53:09.833039609Z E0206 21:53:09.832856       1 event.go:257] Could not construct reference to: '&api.Endpoints{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"cluster-autoscaler", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, Subsets:[]api.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' '%v became leader' 'cluster-autoscaler-362589257-qvjps'
2017-02-06T21:53:09.833083521Z I0206 21:53:09.832940       1 leaderelection.go:215] sucessfully acquired lease kube-system/cluster-autoscaler
2017-02-06T21:55:10.236022480Z E0206 21:55:10.235891       1 aws_manager.go:81] Error while regenerating Asg cache: RequestError: send request failed
2017-02-06T21:55:10.236074713Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T21:57:10.654286671Z W0206 21:57:10.654164       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-04dab8e1abd2eeadf}, error: RequestError: send request failed
2017-02-06T21:57:10.654433665Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T21:59:20.945923291Z W0206 21:59:20.945820       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-07cb35848b7303a8c}, error: RequestError: send request failed
2017-02-06T21:59:20.945959787Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:01:31.312714356Z W0206 22:01:31.312578       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-045dfa57161b3f669}, error: RequestError: send request failed
2017-02-06T22:01:31.312762096Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:03:41.648172881Z W0206 22:03:41.648044       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-0922d3fd192fa3708}, error: RequestError: send request failed
2017-02-06T22:03:41.648207274Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:05:51.955455693Z W0206 22:05:51.955355       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-0922d3fd192fa3708}, error: RequestError: send request failed
2017-02-06T22:05:51.955490454Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:08:02.268974568Z W0206 22:08:02.268861       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-0922d3fd192fa3708}, error: RequestError: send request failed
2017-02-06T22:08:02.269019618Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: lookup autoscaling.us-east-1.amazonaws.com on 100.64.0.10:53: dial udp 100.64.0.10:53: i/o timeout
2017-02-06T22:10:12.709737594Z W0206 22:10:12.709623       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-07cb35848b7303a8c}, error: RequestError: send request failed
2017-02-06T22:10:12.709795552Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:12:22.947293286Z W0206 22:12:22.947191       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-04dab8e1abd2eeadf}, error: RequestError: send request failed
2017-02-06T22:12:22.947324194Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T22:14:33.216621196Z W0206 22:14:33.216505       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-07cb35848b7303a8c}, error: RequestError: send request failed
2017-02-06T22:14:33.216661881Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 27 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Figured this out, it was the fact that a kube-dns pod was not running on the master node. To run it, had to add the master toleration to the kube-dns deployment (same as with cluster-autoscaler deployment above). Once kube-dns was running on the master, autoscaler was able to use it to get ASG info from AWS and scale up from 0 nodes.

The problem might also come and go. Or not be triggered until there’s a scaling event.

In a cluster that had previously not reported any errors, I intentional deployed an exorbitant number of replicas to trigger a scaling event, but it failed while trying to scale up.

I0208 00:37:08.293007       1 scale_down.go:163] No candidates for scale down
I0208 00:37:18.540502       1 scale_down.go:163] No candidates for scale down
I0208 00:37:28.882205       1 scale_down.go:163] No candidates for scale down
I0208 00:37:39.139642       1 scale_down.go:163] No candidates for scale down
I0208 00:37:49.430254       1 scale_down.go:163] No candidates for scale down
I0208 00:37:59.891498       1 scale_down.go:163] No candidates for scale down
W0208 00:40:10.266085       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-091092f5959263783}, error: RequestError: send request failed
caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout

@pluttrell At a quick review it seems a problem with your dns

@yissachar Thanks for the suggestion. I deleted the old deployment and recreated it, but this time using the name of my Nodes ASG, as follows:

GROUP_NAME="nodes.${NAME}"

But still see the same problem:

2017-02-06T23:14:01.681430960Z I0206 23:14:01.681296       1 cluster_autoscaler.go:353] Cluster Autoscaler 0.4.0
2017-02-06T23:14:01.979278933Z I0206 23:14:01.979177       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:05.432683960Z I0206 23:14:05.432589       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:09.776690773Z I0206 23:14:09.776600       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:13.373972813Z I0206 23:14:13.373852       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:16.476140474Z I0206 23:14:16.476027       1 leaderelection.go:295] lock is held by cluster-autoscaler-362589257-en53e and has not yet expired
2017-02-06T23:14:19.620142570Z E0206 23:14:19.619954       1 event.go:257] Could not construct reference to: '&api.Endpoints{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"cluster-autoscaler", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, Subsets:[]api.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' '%v became leader' 'cluster-autoscaler-3428038825-s49yu'
2017-02-06T23:14:19.620201936Z I0206 23:14:19.620108       1 leaderelection.go:215] sucessfully acquired lease kube-system/cluster-autoscaler
2017-02-06T23:16:20.435300340Z E0206 23:16:20.435030       1 aws_manager.go:81] Error while regenerating Asg cache: RequestError: send request failed
2017-02-06T23:16:20.435344746Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout
2017-02-06T23:18:20.772820200Z W0206 23:18:20.772702       1 cluster_autoscaler.go:202] Cluster is not ready for autoscaling: Error while looking for ASG for instance {Name:i-06f447adf256f4e82}, error: RequestError: send request failed
2017-02-06T23:18:20.772863488Z caused by: Post https://autoscaling.us-east-1.amazonaws.com/: dial tcp: i/o timeout

Mine turned out to be specifying the AZ (us-east-1a), and not the region (us-east-1). The URL showed my error, but I overlooked it.

I get this error in my autoscaler log

2017-06-28T17:15:18.034832321Z I0628 17:15:18.034555       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:20.110544607Z I0628 17:15:20.110304       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:22.116750413Z I0628 17:15:22.116519       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:24.123511893Z I0628 17:15:24.123260       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:26.210659327Z I0628 17:15:26.210364       1 leaderelection.go:204] succesfully renewed lease kube-system/cluster-autoscaler
2017-06-28T17:15:26.714239374Z E0628 17:15:26.713979       1 static_autoscaler.go:108] Failed to update node registry: RequestError: send request failed
2017-06-28T17:15:26.714267753Z caused by: Post https://autoscaling.ap-southeast-1a.amazonaws.com/: dial tcp: lookup autoscaling.ap-southeast-1a.amazonaws.com on 100.64.0.10:53: no such host

what does error mean ? It’s trying to connect to some unknown host 100.64.0.10:53

Finally figured it out. Following the template at https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#1-asg-setup-min-1-max-10-asg-name-k8s-worker-asg-1 works, but the template at https://github.com/kubernetes/kops/tree/master/addons/cluster-autoscaler doesn’t.

Correcting for indentation, the diff is (working version is ‘<’)

7c7
<     app: cluster-autoscaler
---
>     k8s-app: cluster-autoscaler
12c12
<       app: cluster-autoscaler
---
>       k8s-app: cluster-autoscaler
16c16,18
<         app: cluster-autoscaler
---
>         k8s-app: cluster-autoscaler
>       annotations:
>         scheduler.alpha.kubernetes.io/tolerations: '[{"key":"dedicated", "value":"master"}]'
19,20c21,22
<         - image: gcr.io/google_containers/cluster-autoscaler:v0.4.0
<           name: cluster-autoscaler
---
>         - name: cluster-autoscaler
>           image: gcr.io/google_containers/cluster-autoscaler:v0.4.0
30d31
<             - --v=4
32d32
<             - --skip-nodes-with-local-storage=false
41d40
<           imagePullPolicy: "Always"
45c44,46
<             path: "/etc/ssl/certs/ca-certificates.crt"
---
>             path: /etc/ssl/certs/ca-certificates.crt
>       nodeSelector:
>         kubernetes.io/role: master

Which of those is the crucial difference I don’t know.