autoscaler: nil pointer dereference while fetching node instances for aws asg
I am seeing this error:
I0703 17:25:41.236830 1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size to 3
I0703 17:25:41.281175 1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-misc-9c55878cc-vj7nr", UID:"91b85597-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074967", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281214 1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size set to 3
I0703 17:25:41.281226 1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"proxy-6678dc85f6-q2cdc", UID:"8fa7bd40-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074922", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281272 1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-replication-dc7596ccf-6294t", UID:"930a91d7-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074999", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281303 1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-salesforce-6fd96cfddd-zqlhj", UID:"9420fd2c-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5075027", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281380 1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"hub-5fb7c84ccb-85ns8", UID:"8ea23011-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074887", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:43.503908 1 node_instances_cache.go:155] Start refreshing cloud provider node instances cache
I0703 17:25:43.503948 1 node_instances_cache.go:167] Refresh cloud provider node instances cache finished, refresh took 16.505µs
I0703 17:25:51.293772 1 static_autoscaler.go:187] Starting main loop
I0703 17:25:51.336472 1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [K3-EKS-mixedspotinstm52xlargeasgsubnet02af43b02922e710f-QR5SWRK0N1Q0 K3-EKS-mixedspotinstm52xlargeasgsubnet09df9044a965c5907-78SRQM8MXOR6 K3-EKS-mixedspotinstm52xlargeasgsubnet0d22e2495433092d1-7ZH6F3I2Y90L K3-EKS-mixedspotinstt32xlargeasgsubnet02af43b02922e710f-10SJB4NDVE9VX K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O K3-EKS-mixedspotinstt32xlargeasgsubnet0d22e2495433092d1-8EG7GH4K6BDQ K3-EKS-ondemandasgsubnet02af43b02922e710f-YFGY7WRUEUQ1]
I0703 17:25:51.394755 1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-07-03 17:26:51.394747295 +0000 UTC m=+788.208984766
E0703 17:25:51.455558 1 node_instances_cache.go:106] Failed to fetch cloud provider node instances for K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O, error <nil>
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2222355]
goroutine 65 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc001796160, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc001796160, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc000c28820, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0xc000e36cc0, 0x4)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000c28820, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0000da000)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff4fc0, 0xc0004be180)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec
I have been aggressively scaling up my workloads in this new kube cluster. I am using the autoscaler helm chart with these values:
autoDiscovery:
clusterName: K3
image:
repository: k8s.gcr.io/cluster-autoscaler
tag: v1.15.0
rbac:
create: true
# must override: https://github.com/kubernetes/autoscaler/issues/2139
sslCertHostPath: /etc/ssl/certs/ca-bundle.crt
podAnnotations:
iam.amazonaws.com/role: k8s-autoscaler
extraArgs:
v: 4
stderrthreshold: info
logtostderr: true
# write-status-configmap: true
# leader-elect: true
skip-nodes-with-local-storage: "false"
expander: most-pods
# scale-down-enabled: true
# balance-similar-node-groups: true
# min-replica-count: 2
scale-down-utilization-threshold: 0.75
# scale-down-non-empty-candidates-count: 5
# max-node-provision-time: 15m0s
# scan-interval: 10s
# scale-down-delay: 10m
# scale-down-unneeded-time: 10m
skip-nodes-with-local-storage: "false"
skip-nodes-with-system-pods: "false"
awsRegion: eu-central-1
and I am running autoscaler 1.15.0 (I know that’s mixing version from EKS 1.13.7 but it’s the only way to work with MixedInstanceGroup ASGs.
I0703 17:25:52.680391 1 flags.go:52] FLAG: --address=":8085"
I0703 17:25:52.681144 1 flags.go:52] FLAG: --alsologtostderr="false"
I0703 17:25:52.681163 1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I0703 17:25:52.681169 1 flags.go:52] FLAG: --cloud-config=""
I0703 17:25:52.681174 1 flags.go:52] FLAG: --cloud-provider="aws"
I0703 17:25:52.681187 1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0703 17:25:52.681197 1 flags.go:52] FLAG: --cluster-name=""
I0703 17:25:52.681202 1 flags.go:52] FLAG: --cores-total="0:320000"
I0703 17:25:52.681211 1 flags.go:52] FLAG: --estimator="binpacking"
I0703 17:25:52.681217 1 flags.go:52] FLAG: --expander="most-pods"
I0703 17:25:52.681229 1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0703 17:25:52.681234 1 flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
I0703 17:25:52.681240 1 flags.go:52] FLAG: --gpu-total="[]"
I0703 17:25:52.681246 1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0703 17:25:52.681251 1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0703 17:25:52.681261 1 flags.go:52] FLAG: --ignore-taint="[]"
I0703 17:25:52.681270 1 flags.go:52] FLAG: --kubeconfig=""
I0703 17:25:52.681275 1 flags.go:52] FLAG: --kubernetes=""
I0703 17:25:52.681279 1 flags.go:52] FLAG: --leader-elect="true"
I0703 17:25:52.681288 1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0703 17:25:52.681297 1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0703 17:25:52.681307 1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0703 17:25:52.681314 1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0703 17:25:52.681319 1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0703 17:25:52.681326 1 flags.go:52] FLAG: --log-dir=""
I0703 17:25:52.681331 1 flags.go:52] FLAG: --log-file=""
I0703 17:25:52.681342 1 flags.go:52] FLAG: --log-file-max-size="1800"
I0703 17:25:52.681347 1 flags.go:52] FLAG: --logtostderr="true"
I0703 17:25:52.681352 1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0703 17:25:52.681357 1 flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
I0703 17:25:52.681362 1 flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
I0703 17:25:52.681367 1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0703 17:25:52.681377 1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0703 17:25:52.681386 1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0703 17:25:52.681391 1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0703 17:25:52.681396 1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0703 17:25:52.681401 1 flags.go:52] FLAG: --max-nodes-total="0"
I0703 17:25:52.681405 1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0703 17:25:52.681433 1 flags.go:52] FLAG: --memory-total="0:6400000"
I0703 17:25:52.681439 1 flags.go:52] FLAG: --min-replica-count="0"
I0703 17:25:52.681443 1 flags.go:52] FLAG: --namespace="kube-system"
I0703 17:25:52.681448 1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0703 17:25:52.681453 1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0703 17:25:52.681457 1 flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
I0703 17:25:52.681466 1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/K3]"
I0703 17:25:52.681479 1 flags.go:52] FLAG: --nodes="[]"
I0703 17:25:52.681484 1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0703 17:25:52.681489 1 flags.go:52] FLAG: --regional="false"
I0703 17:25:52.681494 1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0703 17:25:52.681608 1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0703 17:25:52.681638 1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0703 17:25:52.681645 1 flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
I0703 17:25:52.681649 1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0703 17:25:52.681653 1 flags.go:52] FLAG: --scale-down-enabled="true"
I0703 17:25:52.681657 1 flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
I0703 17:25:52.681665 1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0703 17:25:52.681669 1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0703 17:25:52.681672 1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0703 17:25:52.681676 1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.75"
I0703 17:25:52.681679 1 flags.go:52] FLAG: --scan-interval="10s"
I0703 17:25:52.681684 1 flags.go:52] FLAG: --skip-headers="false"
I0703 17:25:52.681694 1 flags.go:52] FLAG: --skip-log-headers="false"
I0703 17:25:52.681699 1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I0703 17:25:52.681704 1 flags.go:52] FLAG: --skip-nodes-with-system-pods="false"
I0703 17:25:52.681709 1 flags.go:52] FLAG: --stderrthreshold="0"
I0703 17:25:52.681713 1 flags.go:52] FLAG: --test.bench=""
I0703 17:25:52.681717 1 flags.go:52] FLAG: --test.benchmem="false"
I0703 17:25:52.681731 1 flags.go:52] FLAG: --test.benchtime="1s"
I0703 17:25:52.681735 1 flags.go:52] FLAG: --test.blockprofile=""
I0703 17:25:52.681738 1 flags.go:52] FLAG: --test.blockprofilerate="1"
I0703 17:25:52.681742 1 flags.go:52] FLAG: --test.count="1"
I0703 17:25:52.681746 1 flags.go:52] FLAG: --test.coverprofile=""
I0703 17:25:52.681752 1 flags.go:52] FLAG: --test.cpu=""
I0703 17:25:52.681758 1 flags.go:52] FLAG: --test.cpuprofile=""
I0703 17:25:52.681762 1 flags.go:52] FLAG: --test.failfast="false"
I0703 17:25:52.681765 1 flags.go:52] FLAG: --test.list=""
I0703 17:25:52.681770 1 flags.go:52] FLAG: --test.memprofile=""
I0703 17:25:52.681776 1 flags.go:52] FLAG: --test.memprofilerate="0"
I0703 17:25:52.681786 1 flags.go:52] FLAG: --test.mutexprofile=""
I0703 17:25:52.681790 1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I0703 17:25:52.681796 1 flags.go:52] FLAG: --test.outputdir=""
I0703 17:25:52.681809 1 flags.go:52] FLAG: --test.parallel="8"
I0703 17:25:52.681812 1 flags.go:52] FLAG: --test.run=""
I0703 17:25:52.681816 1 flags.go:52] FLAG: --test.short="false"
I0703 17:25:52.681822 1 flags.go:52] FLAG: --test.testlogfile=""
I0703 17:25:52.681825 1 flags.go:52] FLAG: --test.timeout="0s"
I0703 17:25:52.681829 1 flags.go:52] FLAG: --test.trace=""
I0703 17:25:52.681832 1 flags.go:52] FLAG: --test.v="false"
I0703 17:25:52.681835 1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0703 17:25:52.681839 1 flags.go:52] FLAG: --v="4"
I0703 17:25:52.681845 1 flags.go:52] FLAG: --vmodule=""
I0703 17:25:52.681849 1 flags.go:52] FLAG: --write-status-configmap="true"
I0703 17:25:52.681869 1 main.go:354] Cluster Autoscaler 1.15.0
one of my autoscaling groups cloudformations:
mixedspotinstm52xlargeasgsubnet02af43b02922e710f:
DependsOn:
- K3Cluster
- mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate
Properties:
MaxSize: 7
MinSize: 0
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 0
OnDemandPercentageAboveBaseCapacity: 0
SpotAllocationStrategy: lowest-price
SpotInstancePools: 2
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateId: !Ref 'mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate'
Version: '1'
Overrides:
- InstanceType: t3.2xlarge
- InstanceType: m5.2xlarge
Tags:
- Key: Name
PropagateAtLaunch: true
Value: K3 Cluster Node
- Key: kubernetes.io/cluster/K3
PropagateAtLaunch: true
Value: owned
- Key: k8s.io/cluster-autoscaler/enabled
PropagateAtLaunch: true
Value: you know it
- Key: k8s.io/cluster-autoscaler/K3
PropagateAtLaunch: true
Value: you know it
VPCZoneIdentifier:
- subnet-02af43b02922e710f
Type: AWS::AutoScaling::AutoScalingGroup
UpdatePolicy: {}
mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate:
Properties:
LaunchTemplateData:
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200
VolumeType: gp2
EbsOptimized: 'true'
IamInstanceProfile:
Arn: !GetAtt 'k3instanceprofile.Arn'
ImageId: ami-02d5e7ca7bc498ef9
InstanceType: t3.2xlarge
SecurityGroupIds:
- !GetAtt 'k3NodeSecurityGroup.GroupId'
UserData: !Base64
Fn::Sub: "#!/bin/bash\n set -o xtrace\n echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCo/nsu1N9kZ9oYC7bkuj4wz0CnmLhif+zv8Tuf7kPq/hzN4QBAgWT7XUMqG2sJlx+6SHS2zz1kQsklm9SAX+bNC25A2Iqe9J5TS8uiOdaZd2dKXQSeQsAt9TB1bqg2ZbjcJHol/qvfb1KkceMk1Kvhi7jimztbEwZaWyHRQRBUbl0AWnYUjwBo1RPaXu9mejkYSP8OOoYIjOhHAeL3pmJ+58dSLCN3kgXQuBfdb9Ap4R9YjxvVpXDmh6E1KmyeFXLq8Vm7GUePjHYPsie98oHArrvic7wsm3xAHg6IT0l0CRY53yb3gnVJZmUicEkmUj01xxpE2uI0H8kaezP89t/5'\
\ >> /home/ec2-user/.ssh/authorized_keys\n /etc/eks/bootstrap.sh\
\ K3 --kubelet-extra-args '--node-labels=asg=mixedspotinstm52xlarge'\n\
\ /opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName}\
\ --resource mixedspotinstm52xlargeasgsubnet02af43b02922e710f\
\ --region ${AWS::Region}\n "
LaunchTemplateName: mixedspotinstm52xlargesubnet02af43b02922e710fkubenode
Type: AWS::EC2::LaunchTemplate
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (11 by maintainers)
This was indeed fixed, I’m looking at a long uptime!
Thanks so much, this is awesome!