karpenter: Karpenter stuck on "Waiting on Cluster Sync" with non-existent StorageClass

Description

Observed Behavior:

Hi everyone, I’ve deployed Karpenter using the corresponding Helm chart for version v0.31.0 but i encounter a very strange problem: the main replica keeps logging “Waiting on cluster sync” and basically doesn’t spawn any nodes. I’ve checked the AWS Auth configmap of my Cluster and the role is correctly present as well as the Karpenter service account and basic Karpenter configmap. In the past months I’ve succesfully installed Karpenter v.0.27.3 in more than 15 clusters at my company using Helm chart v0.27.3 and we didn’t encounter problems at all.

(The Machine CRD was added to the workflow as described in the docs, since it was added in v0.28)

These are the logs

2023-10-27T17:01:47.700Z        DEBUG   controller.provisioner  waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:48.689Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:49.690Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:50.691Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:51.692Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:52.693Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:53.693Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:54.694Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:55.695Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:56.695Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:57.696Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:57.701Z        DEBUG   controller.provisioner  waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:58.697Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:59.697Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:00.698Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:01.699Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:02.699Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:03.700Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:04.700Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:05.701Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:06.701Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:07.702Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:07.702Z        DEBUG   controller.provisioner  waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:08.702Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:09.703Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:10.704Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:11.705Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:12.705Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:13.706Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:14.707Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:15.708Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:16.708Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:17.703Z        DEBUG   controller.provisioner  waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:17.709Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:18.710Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:19.711Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:20.712Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:21.712Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:22.713Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:23.714Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:24.715Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:25.715Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:26.716Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:27.704Z        DEBUG   controller.provisioner  waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:27.717Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:28.717Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:29.718Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:30.719Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:31.719Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:32.720Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:33.720Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:34.721Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:35.722Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:36.722Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:37.704Z        DEBUG   controller.provisioner  waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:37.723Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:38.724Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:39.724Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:40.725Z        DEBUG   controller.deprovisioning       waiting on cluster sync {"commit": "61b3e1e-dirty"}

Another thing i noticed is that Machine resources are not getting created at all after deploying my AWSNodeTemplate and Provisioner. I fear that I’m missing some AWS-related resources, maybe were introduced after 0.27.3, but from the docs it doesn’t seem anything AWS-related was added. The nodes are not even started (I can’t see them in EC2 Console)

Expected Behavior:

I should see nodes getting started from AWS EC2 Console first and then I should see them getting attached to the Cluster

Reproduction Steps (Please include YAML):

apiVersion: v1
kind: Namespace
metadata:
  name: karpetenter-test-ns
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: provisioner-test
spec:
  consolidation:
    enabled: true
  labels:
    intent: apps
  requirements:
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values:
        - c
        - m
        - r
        - t
    - key: karpenter.sh/capacity-type
      operator: In
      values:
        - ondemand
        - spot
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64
    - key: karpenter.k8s.aws/instance-size
      operator: NotIn
      values:
        - nano
        - micro
        - small
        - medium
        - large
  limits:
    resources:
      cpu: 1000
      memory: 1000Gi
  ttlSecondsUntilExpired: 2592000
  providerRef:
    name: node-template-test

---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: node-template-test
spec:
  securityGroupSelector:
    karpenter.sh/discovery: mycluster
  subnetSelector:
    karpenter.sh/discovery: mycluster
  tags:
    KarpenerProvisionerName: provisioner-test
    NodeType: karpenter-workshop
    IntentLabel: apps
    Gias_ID: test
    Env: test
    Name: test
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
  namespace: karpetenter-test-ns
spec:
  replicas: 5
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      nodeSelector:
        intent: apps
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
          resources:
            requests:
              cpu: 1
              memory: 1.5Gi

This is the Cloudformation used to create Karpenter resources on AWS

#Template
AWSTemplateFormatVersion: '2010-09-09'

Description: EKS infrastructure for Karpenter

Parameters:
  EksClusterName:
    Type: String
    Description: Name of existing EKS Cluster
  QueueName:
    Type: String
    Description: Name of SQS Queue to create where interruption events will be notified
  RoleNamePrefix:
    Type: String
  EventBridgeRulePrefix:
    Type: String
  OidcEndpoint:
    Type: String
    Description: EKS OIDC Endpoint without https://

Resources:
  NodeRole:
    Type: 'AWS::IAM::Role'
    Properties:
      RoleName: !Join [ "-", [!Ref RoleNamePrefix, "KarpenterNodeRole"] ]
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ec2.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        - arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      Policies:
        - PolicyName: CWLogsPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - logs:DescribeLogGroups
                  - logs:DescribeLogStreams
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvent
                  - logs:PutRetentionPolicy
                Resource:
                    - "*"
        - PolicyName: ECRPullPushPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ecr:BatchCheckLayerAvailability
                  - ecr:GetDownloadUrlForLayer
                  - ecr:GetRepositoryPolicy
                  - ecr:DescribeRepositories
                  - ecr:ListImages
                  - ecr:DescribeImages
                  - ecr:BatchGetImage
                  - ecr:GetAuthorizationToken
                Resource:
                  - "*"
        - PolicyName: LambdaInVpcPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - ec2:DescribeAvailabilityZones
                  - ec2:DescribeNetworkInterfaceAttribute
                  - ec2:DescribeNetworkInterfaces
                  - ec2:DescribeSecurityGroups
                  - ec2:DescribeSubnets
                  - autoscaling:DescribeVpcAttribute
                  - elasticfilesystem:Describe*
                  - kms:ListAliases
                Resource:
                  - "*"
        - PolicyName: ClusterAutoscalerPolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - autoscaling:DescribeAutoScalingGroups
                  - autoscaling:DescribeAutoScalingInstances
                  - autoscaling:DescribeLaunchConfigurations
                  - autoscaling:DescribeTags
                  - autoscaling:SetDesiredCapacity
                  - autoscaling:TerminateInstanceInAutoScalingGroup
                Resource:
                  - "*"                  
  NodeInstanceProfile:
    Type: 'AWS::IAM::InstanceProfile'
    Properties:
      Path: /
      InstanceProfileName:  !Join [ "-", [!Ref RoleNamePrefix, "KarpenterInstanceProfile"] ]
      Roles:
        - !Ref NodeRole
  ControllerRole:
    Type: 'AWS::IAM::Role'
    Properties:
      RoleName: !Join [ "-", [!Ref RoleNamePrefix, "KarpenterCtrlRole"] ]
      AssumeRolePolicyDocument: !Sub
        - |
          {
              "Version": "2012-10-17",
              "Statement": [
                  {
                      "Effect": "Allow",
                      "Principal": {
                          "Federated": "arn:${AWS::Partition}:iam::${AWS::AccountId}:oidc-provider/${OidcEndpoint}"
                      },
                      "Action": "sts:AssumeRoleWithWebIdentity",
                      "Condition": {
                          "StringEquals": {
                              "${OidcEndpoint}:aud": "sts.amazonaws.com",
                              "${OidcEndpoint}:sub": "system:serviceaccount:karpenter:karpenter"
                          }
                      }
                  }
              ]
          }
        - OidcEndpoint: !Ref OidcEndpoint
      Path: /
      Policies:
        - PolicyName: karpenter-controller
          PolicyDocument: !Sub
            - |
              {
                  "Statement": [
                      {
                          "Action": [
                              "ssm:GetParameter",
                              "ec2:DescribeImages",
                              "ec2:RunInstances",
                              "ec2:DescribeSubnets",
                              "ec2:DescribeSecurityGroups",
                              "ec2:DescribeLaunchTemplates",
                              "ec2:DescribeInstances",
                              "ec2:DescribeInstanceTypes",
                              "ec2:DescribeInstanceTypeOfferings",
                              "ec2:DescribeAvailabilityZones",
                              "ec2:DeleteLaunchTemplate",
                              "ec2:CreateTags",
                              "ec2:CreateLaunchTemplate",
                              "ec2:CreateFleet",
                              "ec2:DescribeSpotPriceHistory",
                              "pricing:GetProducts"
                          ],
                          "Effect": "Allow",
                          "Resource": "*",
                          "Sid": "Karpenter"
                      },
                      {
                          "Action": "ec2:TerminateInstances",
                          "Condition": {
                              "StringLike": {
                                  "ec2:ResourceTag/karpenter.sh/provisioner-name": "*"
                              }
                          },
                          "Effect": "Allow",
                          "Resource": "*",
                          "Sid": "ConditionalEC2Termination"
                      },
                      {
                          "Effect": "Allow",
                          "Action": "iam:PassRole",
                          "Resource": "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${NodeRole}",
                          "Sid": "PassNodeIAMRole"
                      },
                      {
                          "Effect": "Allow",
                          "Action": "eks:DescribeCluster",
                          "Resource": "arn:${AWS::Partition}:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}",
                          "Sid": "EKSClusterEndpointLookup"
                      }
                  ],
                  "Version": "2012-10-17"
              }
            - ClusterName: !Ref EksClusterName
        - PolicyName: karpenter-interruption
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - "sqs:DeleteMessage"
                  - "sqs:GetQueueAttributes"
                  - "sqs:GetQueueUrl"
                  - "sqs:ReceiveMessage"
                Resource: !Sub
                  - "arn:${AWS::Partition}:sqs:${AWS::Region}:${AWS::AccountId}:${SqsQueueName}"
                  - SqsQueueName: !GetAtt [ SqsQueue, "QueueName" ]
  SqsQueue: 
    Type: AWS::SQS::Queue
    Properties: 
      QueueName: !Ref QueueName
      MessageRetentionPeriod: 300
      SqsManagedSseEnabled: true
  SqsQueuePolicy:
    Type: AWS::SQS::QueuePolicy
    Properties:
      Queues:
        - !Ref SqsQueue
      PolicyDocument:
        Statement:
          - Action: 
              - "SQS:SendMessage" 
            Effect: "Allow"
            Resource: !Sub
              - "arn:${AWS::Partition}:sqs:${AWS::Region}:${AWS::AccountId}:${SqsQueueName}"
              - SqsQueueName: !GetAtt [ SqsQueue, "QueueName" ]
            Principal:  
              Service:
                - "events.amazonaws.com"
                - "sqs.amazonaws.com"

  HealthEventRule:
    Type: 'AWS::Events::Rule'
    Properties:
      Description: AWS Health Event to SQS Queue
      Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "HE" ] ]
      State: ENABLED
      EventPattern:
        detail:
          service:
          - EC2
        detail-type:
        - AWS Health Event
        source:
        - aws.health
      Targets:
        - Arn: !GetAtt [ SqsQueue, "Arn" ]
          Id: 'KarpenterInterruptionQueueTarget'

  SpotInterruptionRule:
    Type: 'AWS::Events::Rule'
    Properties:
      Description: AWS Spot Interruption to SQS Queue
      Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "SI" ] ]
      State: ENABLED
      EventPattern:
        detail-type:
        - EC2 Spot Instance Interruption Warning
        source:
        - aws.ec2
      Targets:
        - Arn: !GetAtt [ SqsQueue, "Arn" ]
          Id: 'KarpenterInterruptionQueueTarget'

  InstanceRebalanceRule:
    Type: 'AWS::Events::Rule'
    Properties:
      Description: AWS Instance Rebalance to SQS Queue
      Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "IR" ] ]
      State: ENABLED
      EventPattern:
        detail-type:
        - EC2 Instance Rebalance Recommendation
        source:
        - aws.ec2
      Targets:
        - Arn: !GetAtt [ SqsQueue, "Arn" ]
          Id: 'KarpenterInterruptionQueueTarget'

  InstanceStateChangeRule:
    Type: 'AWS::Events::Rule'
    Properties:
      Description: AWS Instance State Change to SQS Queue
      Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "ISC" ] ]
      State: ENABLED
      EventPattern:
        detail:
          state:
          - stopping
          - terminated
          - shutting-down
          - stopped
        detail-type:
        - EC2 Instance State-change Notification
        source:
        - aws.ec2
      Targets:
        - Arn: !GetAtt [ SqsQueue, "Arn" ]
          Id: 'KarpenterInterruptionQueueTarget'

Outputs:
  NodeRoleARN:
    Value: !GetAtt [ NodeRole, "Arn" ]
  InstanceRoleProfileName:
    Value: !Ref NodeInstanceProfile
  ControllerRoleARN:
    Value: !GetAtt [ ControllerRole, "Arn" ]
  QueueName:
    Value: !GetAtt [ SqsQueue, "QueueName" ]

Versions:

Karpenter: v0.31.0
Chart Version: 0.16.3
Kubernetes Version (kubectl version): 1.25

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 18 (7 by maintainers)

Most upvoted comments

I want to make sure I understand the scenario here. If the storage class doesn’t exist the PVCs shouldn’t be able to provision PVs and the pods should fail to schedule. Or are the PVs being manually provisioned rather than via a storage class (I assume this might be the case given the name manual)? I’ve been able to recreate the error with manually provisioned PVs but haven’t seen Karpenter stuck “waiting on cluster sync”. If you’re still able to provide YAMLs for your resources and Karpenter logs that would be helpful in determining the exact cause.

jmdeal on Nov 14, 2023

Hi @jonathan-innis I will investigate in depth the log history for my clusters where Karpenter is installed with v0.27.3. I’ll let you know if the error log I encountered in the past is the same or different, I recall encountering similar logs with StorageClasses before v0.31.0

samuel-esp on Oct 30, 2023

The issue was solved creating the storage class “manual” in my case

The question now is: is this behavior intended? I remember the same type of “Error” logs in the previous versions and we didn’t have problems @jonathan-innis

samuel-esp on Oct 30, 2023

@samuel-esp

I’ve resolved the issue today. there was a configuration error on a statefulset regarding designating wrong storageclass name, which causing following error

controller	Reconciler error	{"commit": "61b3e1e-dirty", "controller": "node_state", "controllerGroup": "", "controllerKind": "Node", "Node": {"name":"ip-10-15-17-111.ap-northeast-2.compute.internal"}, "namespace": "", "name": "ip-10-15-17-111.ap-northeast-2.compute.internal", "reconcileID": "44d1886d-d994-46dd-999c-9350370361bc", "error": "tracking volume usage, StorageClass.storage.k8s.io \"pypi-packages\" not found"}

after fixing the storageclass error, karpenter finally came to work normal.

You should checkout if there’s a configuration problem which makes karpenter stop working.

ebizboy on Oct 30, 2023