cluster-api-provider-aws: CrashloopBackOff for capa-controller-manager on v0.6.8

/kind bug

What steps did you take and what happened: A CrashloopBackOff caused by SIGSEGV when capa-controller-manager fails to reconcile the nodegroup (mettwo-0) for the management cluster (EKS):

I0820 12:16:50.190041       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="awsmachinepool" "worker count"=1
I0820 12:16:50.190253       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="awsfargateprofile" 
I0820 12:16:50.190289       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="awsfargateprofile" "worker count"=5
I0820 12:16:50.190356       1 generic_predicates.go:168] controllers/AWSFargateProfile "msg"="Resource is not paused, will attempt to map resource" "awsmanagedcontrolplane"="mettwo" "namespace"="default" "predicate"="createEvent" 
I0820 12:16:50.281004       1 awsmanagedmachinepool_controller.go:181] controllers/AWSManagedMachinePool "msg"="Reconciling AWSManagedMachinePool" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" 
I0820 12:16:50.281549       1 eks.go:86] controllers/AWSManagedMachinePool "msg"="Reconciling EKS nodegroup" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" 
I0820 12:16:50.281583       1 roles.go:155] controllers/AWSManagedMachinePool "msg"="Reconciling EKS Nodegroup IAM Role" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" 
I0820 12:16:50.434750       1 iam.go:210] controllers/AWSManagedMachinePool "msg"="Ensuring tags and AssumeRolePolicyDocument are set on role" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" 
I0820 12:16:50.435036       1 iam.go:123] controllers/AWSManagedMachinePool "msg"="Ensuring Polices are attached to role" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" 
I0820 12:16:50.479280       1 nodegroup.go:43] controllers/AWSManagedMachinePool "msg"="describing eks node group" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" "cluster"="mettwo" "nodegroup"="default_mettwo-0"
I0820 12:16:50.572883       1 nodegroup.go:451] controllers/AWSManagedMachinePool "msg"="Found owned EKS nodegroup in AWS" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" "cluster-name"="mettwo" "nodegroup-name"="default_mettwo-0"
I0820 12:16:50.665236       1 nodegroup.go:340] controllers/AWSManagedMachinePool "msg"="Creating taints update for node group" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" "name"="default_mettwo-0" "num_current"=0 "num_required"=0
I0820 12:16:50.665284       1 nodegroup.go:371] controllers/AWSManagedMachinePool "msg"="No updates required for node group taints" "AWSManagedControlPlane"="mettwo" "AWSManagedMachinePool"="mettwo-0" "Cluster"="mettwo" "MachinePool"="mettwo-0" "namespace"="default" "name"="default_mettwo-0"
E0820 12:16:50.665870       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 450 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x216f600, 0x3e8e210)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/runtime/runtime.go:48 +0x82
panic(0x216f600, 0x3e8e210)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/cluster-api-provider-aws/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroupConfig(0xc000811aa0, 0xc0004ae0f0, 0x0, 0x0)
	/workspace/pkg/cloud/services/eks/nodegroup.go:409 +0x400
sigs.k8s.io/cluster-api-provider-aws/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroup(0xc000811aa0, 0xc0008f9440, 0xc0006a36e8)
	/workspace/pkg/cloud/services/eks/nodegroup.go:473 +0x3d1
sigs.k8s.io/cluster-api-provider-aws/pkg/cloud/services/eks.(*NodegroupService).ReconcilePool(0xc000811aa0, 0xc000811aa0, 0xc000056070)
	/workspace/pkg/cloud/services/eks/eks.go:100 +0x186
sigs.k8s.io/cluster-api-provider-aws/exp/controllers.(*AWSManagedMachinePoolReconciler).reconcileNormal(0xc000758230, 0x2a4e3a0, 0xc000056068, 0xc0007cc380, 0xc000400680, 0xc00045ac00, 0xc0008f9440, 0xc0001ed200)
	/workspace/exp/controllers/awsmanagedmachinepool_controller.go:190 +0x1a2
sigs.k8s.io/cluster-api-provider-aws/exp/controllers.(*AWSManagedMachinePoolReconciler).Reconcile(0xc000758230, 0xc0007e67c8, 0x7, 0xc0007e67c0, 0x8, 0xc000a14e00, 0x0, 0x0, 0x0)
	/workspace/exp/controllers/awsmanagedmachinepool_controller.go:174 +0x933
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000b4420, 0x223efe0, 0xc0009edcc0, 0xc0004de500)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:255 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000b4420, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:231 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0000b4420)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:210 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000730230)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000730230, 0x3b9aca00, 0x0, 0x1, 0xc000088240)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000730230, 0x3b9aca00, 0xc000088240)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:192 +0x468
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1ddcbf0]

goroutine 450 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/runtime/runtime.go:55 +0x105
panic(0x216f600, 0x3e8e210)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/cluster-api-provider-aws/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroupConfig(0xc000811aa0, 0xc0004ae0f0, 0x0, 0x0)
	/workspace/pkg/cloud/services/eks/nodegroup.go:409 +0x400
sigs.k8s.io/cluster-api-provider-aws/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroup(0xc000811aa0, 0xc0008f9440, 0xc0006a36e8)
	/workspace/pkg/cloud/services/eks/nodegroup.go:473 +0x3d1
sigs.k8s.io/cluster-api-provider-aws/pkg/cloud/services/eks.(*NodegroupService).ReconcilePool(0xc000811aa0, 0xc000811aa0, 0xc000056070)
	/workspace/pkg/cloud/services/eks/eks.go:100 +0x186
sigs.k8s.io/cluster-api-provider-aws/exp/controllers.(*AWSManagedMachinePoolReconciler).reconcileNormal(0xc000758230, 0x2a4e3a0, 0xc000056068, 0xc0007cc380, 0xc000400680, 0xc00045ac00, 0xc0008f9440, 0xc0001ed200)
	/workspace/exp/controllers/awsmanagedmachinepool_controller.go:190 +0x1a2
sigs.k8s.io/cluster-api-provider-aws/exp/controllers.(*AWSManagedMachinePoolReconciler).Reconcile(0xc000758230, 0xc0007e67c8, 0x7, 0xc0007e67c0, 0x8, 0xc000a14e00, 0x0, 0x0, 0x0)
	/workspace/exp/controllers/awsmanagedmachinepool_controller.go:174 +0x933
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000b4420, 0x223efe0, 0xc0009edcc0, 0xc0004de500)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:255 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000b4420, 0x0)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:231 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc0000b4420)
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:210 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000730230)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000730230, 0x3b9aca00, 0x0, 0x1, 0xc000088240)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000730230, 0x3b9aca00, 0xc000088240)
	/go/pkg/mod/k8s.io/apimachinery@v0.17.17/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.14/pkg/internal/controller/controller.go:192 +0x468

What did you expect to happen: No SIGSEGV

Anything else you would like to add: The issue appeared after upgrade from v0.6.5 to v0.6.8, so I tried to build a new management cluster with latest v1alpha3 components with the same result. The feature flag EXP_EKS must be set to true:

export EXP_EKS=true
export EXP_EKS_IAM=true
export EXP_EKS_ADD_ROLES=true
export EXP_MACHINE_POOL=true

Environment:

  • Cluster-api-provider-aws version: v0.6.8
  • Kubernetes version: (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“19+”, GitVersion:“v1.19.13-eks-8df270”, GitCommit:“8df2700a72a2598fa3a67c05126fa158fd839620”, GitTreeState:“clean”, BuildDate:“2021-07-31T01:36:57Z”, GoVersion:“go1.15.14”, Compiler:“gc”, Platform:“linux/amd64”}
  • OS (e.g. from /etc/os-release):

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 23 (13 by maintainers)

Most upvoted comments

There was a change (#2407) that was added in 0.6.6 in this area of code…actually by me 🙄

I will have a PR for a fix shortly.

Fantastic, thanks @thof

Ok, let me share more info with you within couple minutes.

@richardcase I edited my previous comment with env vars.

Thanks @thof - i’m using them both now to see if i can re-create the issue. I’ll let you know how i get on…

Thanks, will use that to investigate.

@richardcase yes, I did clusterawsadm bootstrap iam create-cloudformation-stack --config cluster-api/bootstrap-config.yaml:

apiVersion: bootstrap.aws.infrastructure.cluster.x-k8s.io/v1alpha1
kind: AWSIAMConfiguration
spec:
  eks:
    enable: true
    iamRoleCreation: true # Set to true if you plan to use the EKSEnableIAM feature flag to enable automatic creation of IAM roles
    defaultControlPlaneRole:
      disable: false # Set to false to enable creation of the default control plane role
    managedMachinePool:
      disable: false # Set to false to enable creation of the default node role for managed machine pools

EDIT: I also set the following environment variables:

export EXP_EKS=true
export EXP_EKS_IAM=true
export EXP_EKS_ADD_ROLES=true
export EXP_MACHINE_POOL=true

To be honest I’ve followed our internal guide to build the management cluster so I’m not really sure if it’s valid…

Thanks. I will try this out.