autoscaler: nil pointer dereference while fetching node instances for aws asg

I am seeing this error:

I0703 17:25:41.236830       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size to 3
I0703 17:25:41.281175       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-misc-9c55878cc-vj7nr", UID:"91b85597-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074967", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281214       1 event.go:258] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"4e7022af-9c7f-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074894", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O size set to 3
I0703 17:25:41.281226       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"proxy-6678dc85f6-q2cdc", UID:"8fa7bd40-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074922", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281272       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-replication-dc7596ccf-6294t", UID:"930a91d7-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074999", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281303       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"staging", Name:"merckgroup-dj-salesforce-6fd96cfddd-zqlhj", UID:"9420fd2c-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5075027", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:41.281380       1 event.go:258] Event(v1.ObjectReference{Kind:"Pod", Namespace:"datascience", Name:"hub-5fb7c84ccb-85ns8", UID:"8ea23011-9db7-11e9-a1d4-06346ba69dd2", APIVersion:"v1", ResourceVersion:"5074887", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O 2->3 (max: 7)}]
I0703 17:25:43.503908       1 node_instances_cache.go:155] Start refreshing cloud provider node instances cache
I0703 17:25:43.503948       1 node_instances_cache.go:167] Refresh cloud provider node instances cache finished, refresh took 16.505µs
I0703 17:25:51.293772       1 static_autoscaler.go:187] Starting main loop
I0703 17:25:51.336472       1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [K3-EKS-mixedspotinstm52xlargeasgsubnet02af43b02922e710f-QR5SWRK0N1Q0 K3-EKS-mixedspotinstm52xlargeasgsubnet09df9044a965c5907-78SRQM8MXOR6 K3-EKS-mixedspotinstm52xlargeasgsubnet0d22e2495433092d1-7ZH6F3I2Y90L K3-EKS-mixedspotinstt32xlargeasgsubnet02af43b02922e710f-10SJB4NDVE9VX K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O K3-EKS-mixedspotinstt32xlargeasgsubnet0d22e2495433092d1-8EG7GH4K6BDQ K3-EKS-ondemandasgsubnet02af43b02922e710f-YFGY7WRUEUQ1]
I0703 17:25:51.394755       1 aws_manager.go:255] Refreshed ASG list, next refresh after 2019-07-03 17:26:51.394747295 +0000 UTC m=+788.208984766
E0703 17:25:51.455558       1 node_instances_cache.go:106] Failed to fetch cloud provider node instances for K3-EKS-mixedspotinstt32xlargeasgsubnet09df9044a965c5907-1F165IWS4480O, error <nil>
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x2222355]

goroutine 65 [running]:
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).updateReadinessStats(0xc001796160, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:576 +0x9a5
k8s.io/autoscaler/cluster-autoscaler/clusterstate.(*ClusterStateRegistry).UpdateNodes(0xc001796160, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/clusterstate/clusterstate.go:310 +0x227
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).updateClusterState(0xc000c28820, 0xc001ca2e40, 0x7, 0x8, 0xc000e32090, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0xc000e36cc0, 0x4)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:568 +0x94
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000c28820, 0xbf3f5727d1822665, 0xa9869e13b3, 0x4cb2be0, 0x0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:217 +0x5e6
main.run(0xc0000da000)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:331 +0x296
main.main.func2(0x2ff4fc0, 0xc0004be180)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:403 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:200 +0xec

I have been aggressively scaling up my workloads in this new kube cluster. I am using the autoscaler helm chart with these values:

autoDiscovery:
  clusterName: K3
image:
  repository: k8s.gcr.io/cluster-autoscaler
  tag: v1.15.0
rbac:
  create: true
# must override: https://github.com/kubernetes/autoscaler/issues/2139
sslCertHostPath: /etc/ssl/certs/ca-bundle.crt
podAnnotations:
  iam.amazonaws.com/role: k8s-autoscaler
extraArgs:
  v: 4
  stderrthreshold: info
  logtostderr: true
  # write-status-configmap: true
  # leader-elect: true
  skip-nodes-with-local-storage: "false"
  expander: most-pods
  # scale-down-enabled: true
  # balance-similar-node-groups: true
  # min-replica-count: 2
  scale-down-utilization-threshold: 0.75
  # scale-down-non-empty-candidates-count: 5
  # max-node-provision-time: 15m0s
  # scan-interval: 10s
  # scale-down-delay: 10m
  # scale-down-unneeded-time: 10m
  skip-nodes-with-local-storage: "false"
  skip-nodes-with-system-pods: "false"
awsRegion: eu-central-1

and I am running autoscaler 1.15.0 (I know that’s mixing version from EKS 1.13.7 but it’s the only way to work with MixedInstanceGroup ASGs.

I0703 17:25:52.680391       1 flags.go:52] FLAG: --address=":8085"
I0703 17:25:52.681144       1 flags.go:52] FLAG: --alsologtostderr="false"
I0703 17:25:52.681163       1 flags.go:52] FLAG: --balance-similar-node-groups="false"
I0703 17:25:52.681169       1 flags.go:52] FLAG: --cloud-config=""
I0703 17:25:52.681174       1 flags.go:52] FLAG: --cloud-provider="aws"
I0703 17:25:52.681187       1 flags.go:52] FLAG: --cloud-provider-gce-lb-src-cidrs="130.211.0.0/22,209.85.152.0/22,209.85.204.0/22,35.191.0.0/16"
I0703 17:25:52.681197       1 flags.go:52] FLAG: --cluster-name=""
I0703 17:25:52.681202       1 flags.go:52] FLAG: --cores-total="0:320000"
I0703 17:25:52.681211       1 flags.go:52] FLAG: --estimator="binpacking"
I0703 17:25:52.681217       1 flags.go:52] FLAG: --expander="most-pods"
I0703 17:25:52.681229       1 flags.go:52] FLAG: --expendable-pods-priority-cutoff="-10"
I0703 17:25:52.681234       1 flags.go:52] FLAG: --filter-out-schedulable-pods-uses-packing="true"
I0703 17:25:52.681240       1 flags.go:52] FLAG: --gpu-total="[]"
I0703 17:25:52.681246       1 flags.go:52] FLAG: --ignore-daemonsets-utilization="false"
I0703 17:25:52.681251       1 flags.go:52] FLAG: --ignore-mirror-pods-utilization="false"
I0703 17:25:52.681261       1 flags.go:52] FLAG: --ignore-taint="[]"
I0703 17:25:52.681270       1 flags.go:52] FLAG: --kubeconfig=""
I0703 17:25:52.681275       1 flags.go:52] FLAG: --kubernetes=""
I0703 17:25:52.681279       1 flags.go:52] FLAG: --leader-elect="true"
I0703 17:25:52.681288       1 flags.go:52] FLAG: --leader-elect-lease-duration="15s"
I0703 17:25:52.681297       1 flags.go:52] FLAG: --leader-elect-renew-deadline="10s"
I0703 17:25:52.681307       1 flags.go:52] FLAG: --leader-elect-resource-lock="endpoints"
I0703 17:25:52.681314       1 flags.go:52] FLAG: --leader-elect-retry-period="2s"
I0703 17:25:52.681319       1 flags.go:52] FLAG: --log-backtrace-at=":0"
I0703 17:25:52.681326       1 flags.go:52] FLAG: --log-dir=""
I0703 17:25:52.681331       1 flags.go:52] FLAG: --log-file=""
I0703 17:25:52.681342       1 flags.go:52] FLAG: --log-file-max-size="1800"
I0703 17:25:52.681347       1 flags.go:52] FLAG: --logtostderr="true"
I0703 17:25:52.681352       1 flags.go:52] FLAG: --max-autoprovisioned-node-group-count="15"
I0703 17:25:52.681357       1 flags.go:52] FLAG: --max-bulk-soft-taint-count="10"
I0703 17:25:52.681362       1 flags.go:52] FLAG: --max-bulk-soft-taint-time="3s"
I0703 17:25:52.681367       1 flags.go:52] FLAG: --max-empty-bulk-delete="10"
I0703 17:25:52.681377       1 flags.go:52] FLAG: --max-failing-time="15m0s"
I0703 17:25:52.681386       1 flags.go:52] FLAG: --max-graceful-termination-sec="600"
I0703 17:25:52.681391       1 flags.go:52] FLAG: --max-inactivity="10m0s"
I0703 17:25:52.681396       1 flags.go:52] FLAG: --max-node-provision-time="15m0s"
I0703 17:25:52.681401       1 flags.go:52] FLAG: --max-nodes-total="0"
I0703 17:25:52.681405       1 flags.go:52] FLAG: --max-total-unready-percentage="45"
I0703 17:25:52.681433       1 flags.go:52] FLAG: --memory-total="0:6400000"
I0703 17:25:52.681439       1 flags.go:52] FLAG: --min-replica-count="0"
I0703 17:25:52.681443       1 flags.go:52] FLAG: --namespace="kube-system"
I0703 17:25:52.681448       1 flags.go:52] FLAG: --new-pod-scale-up-delay="0s"
I0703 17:25:52.681453       1 flags.go:52] FLAG: --node-autoprovisioning-enabled="false"
I0703 17:25:52.681457       1 flags.go:52] FLAG: --node-deletion-delay-timeout="2m0s"
I0703 17:25:52.681466       1 flags.go:52] FLAG: --node-group-auto-discovery="[asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/K3]"
I0703 17:25:52.681479       1 flags.go:52] FLAG: --nodes="[]"
I0703 17:25:52.681484       1 flags.go:52] FLAG: --ok-total-unready-count="3"
I0703 17:25:52.681489       1 flags.go:52] FLAG: --regional="false"
I0703 17:25:52.681494       1 flags.go:52] FLAG: --scale-down-candidates-pool-min-count="50"
I0703 17:25:52.681608       1 flags.go:52] FLAG: --scale-down-candidates-pool-ratio="0.1"
I0703 17:25:52.681638       1 flags.go:52] FLAG: --scale-down-delay-after-add="10m0s"
I0703 17:25:52.681645       1 flags.go:52] FLAG: --scale-down-delay-after-delete="0s"
I0703 17:25:52.681649       1 flags.go:52] FLAG: --scale-down-delay-after-failure="3m0s"
I0703 17:25:52.681653       1 flags.go:52] FLAG: --scale-down-enabled="true"
I0703 17:25:52.681657       1 flags.go:52] FLAG: --scale-down-gpu-utilization-threshold="0.5"
I0703 17:25:52.681665       1 flags.go:52] FLAG: --scale-down-non-empty-candidates-count="30"
I0703 17:25:52.681669       1 flags.go:52] FLAG: --scale-down-unneeded-time="10m0s"
I0703 17:25:52.681672       1 flags.go:52] FLAG: --scale-down-unready-time="20m0s"
I0703 17:25:52.681676       1 flags.go:52] FLAG: --scale-down-utilization-threshold="0.75"
I0703 17:25:52.681679       1 flags.go:52] FLAG: --scan-interval="10s"
I0703 17:25:52.681684       1 flags.go:52] FLAG: --skip-headers="false"
I0703 17:25:52.681694       1 flags.go:52] FLAG: --skip-log-headers="false"
I0703 17:25:52.681699       1 flags.go:52] FLAG: --skip-nodes-with-local-storage="false"
I0703 17:25:52.681704       1 flags.go:52] FLAG: --skip-nodes-with-system-pods="false"
I0703 17:25:52.681709       1 flags.go:52] FLAG: --stderrthreshold="0"
I0703 17:25:52.681713       1 flags.go:52] FLAG: --test.bench=""
I0703 17:25:52.681717       1 flags.go:52] FLAG: --test.benchmem="false"
I0703 17:25:52.681731       1 flags.go:52] FLAG: --test.benchtime="1s"
I0703 17:25:52.681735       1 flags.go:52] FLAG: --test.blockprofile=""
I0703 17:25:52.681738       1 flags.go:52] FLAG: --test.blockprofilerate="1"
I0703 17:25:52.681742       1 flags.go:52] FLAG: --test.count="1"
I0703 17:25:52.681746       1 flags.go:52] FLAG: --test.coverprofile=""
I0703 17:25:52.681752       1 flags.go:52] FLAG: --test.cpu=""
I0703 17:25:52.681758       1 flags.go:52] FLAG: --test.cpuprofile=""
I0703 17:25:52.681762       1 flags.go:52] FLAG: --test.failfast="false"
I0703 17:25:52.681765       1 flags.go:52] FLAG: --test.list=""
I0703 17:25:52.681770       1 flags.go:52] FLAG: --test.memprofile=""
I0703 17:25:52.681776       1 flags.go:52] FLAG: --test.memprofilerate="0"
I0703 17:25:52.681786       1 flags.go:52] FLAG: --test.mutexprofile=""
I0703 17:25:52.681790       1 flags.go:52] FLAG: --test.mutexprofilefraction="1"
I0703 17:25:52.681796       1 flags.go:52] FLAG: --test.outputdir=""
I0703 17:25:52.681809       1 flags.go:52] FLAG: --test.parallel="8"
I0703 17:25:52.681812       1 flags.go:52] FLAG: --test.run=""
I0703 17:25:52.681816       1 flags.go:52] FLAG: --test.short="false"
I0703 17:25:52.681822       1 flags.go:52] FLAG: --test.testlogfile=""
I0703 17:25:52.681825       1 flags.go:52] FLAG: --test.timeout="0s"
I0703 17:25:52.681829       1 flags.go:52] FLAG: --test.trace=""
I0703 17:25:52.681832       1 flags.go:52] FLAG: --test.v="false"
I0703 17:25:52.681835       1 flags.go:52] FLAG: --unremovable-node-recheck-timeout="5m0s"
I0703 17:25:52.681839       1 flags.go:52] FLAG: --v="4"
I0703 17:25:52.681845       1 flags.go:52] FLAG: --vmodule=""
I0703 17:25:52.681849       1 flags.go:52] FLAG: --write-status-configmap="true"
I0703 17:25:52.681869       1 main.go:354] Cluster Autoscaler 1.15.0

one of my autoscaling groups cloudformations:

  mixedspotinstm52xlargeasgsubnet02af43b02922e710f:
    DependsOn:
      - K3Cluster
      - mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate
    Properties:
      MaxSize: 7
      MinSize: 0
      MixedInstancesPolicy:
        InstancesDistribution:
          OnDemandBaseCapacity: 0
          OnDemandPercentageAboveBaseCapacity: 0
          SpotAllocationStrategy: lowest-price
          SpotInstancePools: 2
        LaunchTemplate:
          LaunchTemplateSpecification:
            LaunchTemplateId: !Ref 'mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate'
            Version: '1'
          Overrides:
            - InstanceType: t3.2xlarge
            - InstanceType: m5.2xlarge
      Tags:
        - Key: Name
          PropagateAtLaunch: true
          Value: K3 Cluster Node
        - Key: kubernetes.io/cluster/K3
          PropagateAtLaunch: true
          Value: owned
        - Key: k8s.io/cluster-autoscaler/enabled
          PropagateAtLaunch: true
          Value: you know it
        - Key: k8s.io/cluster-autoscaler/K3
          PropagateAtLaunch: true
          Value: you know it
      VPCZoneIdentifier:
        - subnet-02af43b02922e710f
    Type: AWS::AutoScaling::AutoScalingGroup
    UpdatePolicy: {}
  mixedspotinstm52xlargesubnet02af43b02922e710fLaunchTemplate:
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
          - DeviceName: /dev/sda1
            Ebs:
              VolumeSize: 200
              VolumeType: gp2
        EbsOptimized: 'true'
        IamInstanceProfile:
          Arn: !GetAtt 'k3instanceprofile.Arn'
        ImageId: ami-02d5e7ca7bc498ef9
        InstanceType: t3.2xlarge
        SecurityGroupIds:
          - !GetAtt 'k3NodeSecurityGroup.GroupId'
        UserData: !Base64
          Fn::Sub: "#!/bin/bash\n      set -o xtrace\n      echo 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCo/nsu1N9kZ9oYC7bkuj4wz0CnmLhif+zv8Tuf7kPq/hzN4QBAgWT7XUMqG2sJlx+6SHS2zz1kQsklm9SAX+bNC25A2Iqe9J5TS8uiOdaZd2dKXQSeQsAt9TB1bqg2ZbjcJHol/qvfb1KkceMk1Kvhi7jimztbEwZaWyHRQRBUbl0AWnYUjwBo1RPaXu9mejkYSP8OOoYIjOhHAeL3pmJ+58dSLCN3kgXQuBfdb9Ap4R9YjxvVpXDmh6E1KmyeFXLq8Vm7GUePjHYPsie98oHArrvic7wsm3xAHg6IT0l0CRY53yb3gnVJZmUicEkmUj01xxpE2uI0H8kaezP89t/5'\
            \ >> /home/ec2-user/.ssh/authorized_keys\n      /etc/eks/bootstrap.sh\
            \ K3 --kubelet-extra-args '--node-labels=asg=mixedspotinstm52xlarge'\n\
            \      /opt/aws/bin/cfn-signal --exit-code $?                --stack ${AWS::StackName}\
            \                --resource mixedspotinstm52xlargeasgsubnet02af43b02922e710f\
            \                 --region ${AWS::Region}\n      "
      LaunchTemplateName: mixedspotinstm52xlargesubnet02af43b02922e710fkubenode
    Type: AWS::EC2::LaunchTemplate

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

This was indeed fixed, I’m looking at a long uptime!

cluster-autoscaler-aws-cluster-autoscaler-5f75d89dcf-mhp7v   1/1     Running   0          20h

Thanks so much, this is awesome!