karpenter-provider-aws: Newly added nodes stuck in NotReady state with error: "cni plugin not initialized"

Version

Karpenter: v0.6.3

Kubernetes: v1.20.11

Expected Behavior

New nodes are in a Ready state. Pods successfully scheduled on new nodes.

Actual Behavior

All Nodes created by karpenter stuck in NotReady state with error in description: “KubeletNotReady runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized”. Also, there are no files in nodes /etc/cni/net.d/ dir.

Steps to Reproduce the Problem

I’m trying to use karpenter on the existing cluster. I used “getting starting guide” as a template of what I needed.

Resource Specs and Logs

karpenter-sa.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: MY_ROLE_ARN
    meta.helm.sh/release-name: karpenter
    meta.helm.sh/release-namespace: karpenter
  labels:
    app.kubernetes.io/instance: karpenter
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: karpenter
    app.kubernetes.io/version: 0.6.3
    helm.sh/chart: karpenter-0.6.3
  name: karpenter
  namespace: karpenter
secrets:
- name: karpenter-token-lhr6j

provisioner.yaml

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: dev-spot
spec:
  requirements:
    - key: "topology.kubernetes.io/zone"
      operator: In
      values: ["us-east-1a"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m5.large",
          "m5d.large",
          "m5a.large",
          "m5ad.large",
          "m5n.large",
          "m4.large",
          "r5.large",
          "r5d.large",
          "r5a.large",
          "r5ad.large",
          "r5dn.large",
          "c5.large",
          "c4.large",
          "r4.large",
          "c5d.large",
          "c5n.large",
          "m6g.xlarge",
          "m5a.xlarge",
          "m6a.xlarge",
          "m5.xlarge",
          "m6i.xlarge",
          "a1.2xlarge",
          "m6g.2xlarge",
          "m5zn.xlarge",
          "m5a.2xlarge",
          "m6a.2xlarge",
          "m6i.2xlarge",
          "m5.2xlarge"
          ]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]
  provider:
    securityGroupSelector:
      Name: MY_WORKER_SG
    subnetSelector:
      kubernetes.io/cluster/dev: '*'
    tags:
      Environment: dev
      Terraformed: "false"
      Name: dev-spot-eks_karpenter
      vpc-name: dev
      kubernetes.io/cluster/dev: owned
    instanceProfile: MY_INSTANCE_PROFILE
  ttlSecondsAfterEmpty: 30

I have AmazonEKS_CNI_Policy, AmazonEKSWorkerNodePolicy, AmazonEC2ContainerRegistryReadOnly, AmazonSSMManagedInstanceCore attached to my role. Also, I have a custom policy attached to the role

{
  "Version": "2012-10-17",
  "Statement": [
      {
          "Action": [
              "ec2:CreateLaunchTemplate",
              "ec2:CreateFleet",
              "ec2:RunInstances",
              "ec2:CreateTags",
              "iam:PassRole",
              "ec2:TerminateInstances",
              "ec2:DeleteLaunchTemplate",
              "ec2:DescribeLaunchTemplates",
              "ec2:DescribeInstances",
              "ec2:DescribeSecurityGroups",
              "ec2:DescribeSubnets",
              "ec2:DescribeInstanceTypes",
              "ec2:DescribeInstanceTypeOfferings",
              "ec2:DescribeAvailabilityZones",
              "ssm:GetParameter"
          ],
          "Resource": "*",
          "Effect": "Allow"
      }
  ]
}

I’ve added Trusted entities to my role

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "EKSWorkerAssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::MY_ACC:MY_OIDC"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "MY_OIDC:aud": "sts.amazonaws.com",
                    "MY_OIDC:sub": "system:serviceaccount:karpenter:karpenter"
                }
            }
        }
    ]
}

aws-auth-cm.yaml

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: MY_ROLE_ARN
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system

Karpenter nodes stuck in NotReady state, but cluster-autoscaler’s nodes (if I use it) are scheduling fine.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 3
  • Comments: 16 (9 by maintainers)

Most upvoted comments

I had a similar issue and the cause was missing VPC CNI policy attached to the ec2 AMI role