karpenter-provider-aws: Newly added nodes stuck in NotReady state with error: "cni plugin not initialized"
Version
Karpenter: v0.6.3
Kubernetes: v1.20.11
Expected Behavior
New nodes are in a Ready state. Pods successfully scheduled on new nodes.
Actual Behavior
All Nodes created by karpenter stuck in NotReady state with error in description: “KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized”. Also, there are no files in nodes /etc/cni/net.d/ dir.
Steps to Reproduce the Problem
I’m trying to use karpenter on the existing cluster. I used “getting starting guide” as a template of what I needed.
Resource Specs and Logs
karpenter-sa.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: MY_ROLE_ARN
meta.helm.sh/release-name: karpenter
meta.helm.sh/release-namespace: karpenter
labels:
app.kubernetes.io/instance: karpenter
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: karpenter
app.kubernetes.io/version: 0.6.3
helm.sh/chart: karpenter-0.6.3
name: karpenter
namespace: karpenter
secrets:
- name: karpenter-token-lhr6j
provisioner.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: dev-spot
spec:
requirements:
- key: "topology.kubernetes.io/zone"
operator: In
values: ["us-east-1a"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large",
"m5d.large",
"m5a.large",
"m5ad.large",
"m5n.large",
"m4.large",
"r5.large",
"r5d.large",
"r5a.large",
"r5ad.large",
"r5dn.large",
"c5.large",
"c4.large",
"r4.large",
"c5d.large",
"c5n.large",
"m6g.xlarge",
"m5a.xlarge",
"m6a.xlarge",
"m5.xlarge",
"m6i.xlarge",
"a1.2xlarge",
"m6g.2xlarge",
"m5zn.xlarge",
"m5a.2xlarge",
"m6a.2xlarge",
"m6i.2xlarge",
"m5.2xlarge"
]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot"]
provider:
securityGroupSelector:
Name: MY_WORKER_SG
subnetSelector:
kubernetes.io/cluster/dev: '*'
tags:
Environment: dev
Terraformed: "false"
Name: dev-spot-eks_karpenter
vpc-name: dev
kubernetes.io/cluster/dev: owned
instanceProfile: MY_INSTANCE_PROFILE
ttlSecondsAfterEmpty: 30
I have AmazonEKS_CNI_Policy, AmazonEKSWorkerNodePolicy, AmazonEC2ContainerRegistryReadOnly, AmazonSSMManagedInstanceCore attached to my role. Also, I have a custom policy attached to the role
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:CreateLaunchTemplate",
"ec2:CreateFleet",
"ec2:RunInstances",
"ec2:CreateTags",
"iam:PassRole",
"ec2:TerminateInstances",
"ec2:DeleteLaunchTemplate",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeInstances",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeInstanceTypes",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeAvailabilityZones",
"ssm:GetParameter"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
I’ve added Trusted entities to my role
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EKSWorkerAssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::MY_ACC:MY_OIDC"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"MY_OIDC:aud": "sts.amazonaws.com",
"MY_OIDC:sub": "system:serviceaccount:karpenter:karpenter"
}
}
}
]
}
aws-auth-cm.yaml
apiVersion: v1
data:
mapRoles: |
- groups:
- system:bootstrappers
- system:nodes
rolearn: MY_ROLE_ARN
username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
Karpenter nodes stuck in NotReady state, but cluster-autoscaler’s nodes (if I use it) are scheduling fine.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 16 (9 by maintainers)
I had a similar issue and the cause was missing VPC CNI policy attached to the ec2 AMI role