karpenter: Karpenter stuck on "Waiting on Cluster Sync" with non-existent StorageClass
Description
Observed Behavior:
Hi everyone, I’ve deployed Karpenter using the corresponding Helm chart for version v0.31.0 but i encounter a very strange problem: the main replica keeps logging “Waiting on cluster sync” and basically doesn’t spawn any nodes. I’ve checked the AWS Auth configmap of my Cluster and the role is correctly present as well as the Karpenter service account and basic Karpenter configmap. In the past months I’ve succesfully installed Karpenter v.0.27.3 in more than 15 clusters at my company using Helm chart v0.27.3 and we didn’t encounter problems at all.
(The Machine CRD was added to the workflow as described in the docs, since it was added in v0.28)
These are the logs
2023-10-27T17:01:47.700Z DEBUG controller.provisioner waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:48.689Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:49.690Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:50.691Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:51.692Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:52.693Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:53.693Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:54.694Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:55.695Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:56.695Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:57.696Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:57.701Z DEBUG controller.provisioner waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:58.697Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:01:59.697Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:00.698Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:01.699Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:02.699Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:03.700Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:04.700Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:05.701Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:06.701Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:07.702Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:07.702Z DEBUG controller.provisioner waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:08.702Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:09.703Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:10.704Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:11.705Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:12.705Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:13.706Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:14.707Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:15.708Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:16.708Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:17.703Z DEBUG controller.provisioner waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:17.709Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:18.710Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:19.711Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:20.712Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:21.712Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:22.713Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:23.714Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:24.715Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:25.715Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:26.716Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:27.704Z DEBUG controller.provisioner waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:27.717Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:28.717Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:29.718Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:30.719Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:31.719Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:32.720Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:33.720Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:34.721Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:35.722Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:36.722Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:37.704Z DEBUG controller.provisioner waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:37.723Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:38.724Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:39.724Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
2023-10-27T17:02:40.725Z DEBUG controller.deprovisioning waiting on cluster sync {"commit": "61b3e1e-dirty"}
Another thing i noticed is that Machine resources are not getting created at all after deploying my AWSNodeTemplate and Provisioner. I fear that I’m missing some AWS-related resources, maybe were introduced after 0.27.3, but from the docs it doesn’t seem anything AWS-related was added. The nodes are not even started (I can’t see them in EC2 Console)
Expected Behavior:
I should see nodes getting started from AWS EC2 Console first and then I should see them getting attached to the Cluster
Reproduction Steps (Please include YAML):
apiVersion: v1
kind: Namespace
metadata:
name: karpetenter-test-ns
---
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: provisioner-test
spec:
consolidation:
enabled: true
labels:
intent: apps
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- t
- key: karpenter.sh/capacity-type
operator: In
values:
- ondemand
- spot
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values:
- nano
- micro
- small
- medium
- large
limits:
resources:
cpu: 1000
memory: 1000Gi
ttlSecondsUntilExpired: 2592000
providerRef:
name: node-template-test
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
name: node-template-test
spec:
securityGroupSelector:
karpenter.sh/discovery: mycluster
subnetSelector:
karpenter.sh/discovery: mycluster
tags:
KarpenerProvisionerName: provisioner-test
NodeType: karpenter-workshop
IntentLabel: apps
Gias_ID: test
Env: test
Name: test
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
namespace: karpetenter-test-ns
spec:
replicas: 5
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
nodeSelector:
intent: apps
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.2
resources:
requests:
cpu: 1
memory: 1.5Gi
This is the Cloudformation used to create Karpenter resources on AWS
#Template
AWSTemplateFormatVersion: '2010-09-09'
Description: EKS infrastructure for Karpenter
Parameters:
EksClusterName:
Type: String
Description: Name of existing EKS Cluster
QueueName:
Type: String
Description: Name of SQS Queue to create where interruption events will be notified
RoleNamePrefix:
Type: String
EventBridgeRulePrefix:
Type: String
OidcEndpoint:
Type: String
Description: EKS OIDC Endpoint without https://
Resources:
NodeRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join [ "-", [!Ref RoleNamePrefix, "KarpenterNodeRole"] ]
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service:
- ec2.amazonaws.com
Action:
- 'sts:AssumeRole'
Path: /
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
- arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
- arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Policies:
- PolicyName: CWLogsPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- logs:DescribeLogGroups
- logs:DescribeLogStreams
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvent
- logs:PutRetentionPolicy
Resource:
- "*"
- PolicyName: ECRPullPushPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- ecr:BatchCheckLayerAvailability
- ecr:GetDownloadUrlForLayer
- ecr:GetRepositoryPolicy
- ecr:DescribeRepositories
- ecr:ListImages
- ecr:DescribeImages
- ecr:BatchGetImage
- ecr:GetAuthorizationToken
Resource:
- "*"
- PolicyName: LambdaInVpcPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- ec2:DescribeAvailabilityZones
- ec2:DescribeNetworkInterfaceAttribute
- ec2:DescribeNetworkInterfaces
- ec2:DescribeSecurityGroups
- ec2:DescribeSubnets
- autoscaling:DescribeVpcAttribute
- elasticfilesystem:Describe*
- kms:ListAliases
Resource:
- "*"
- PolicyName: ClusterAutoscalerPolicy
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- autoscaling:DescribeAutoScalingGroups
- autoscaling:DescribeAutoScalingInstances
- autoscaling:DescribeLaunchConfigurations
- autoscaling:DescribeTags
- autoscaling:SetDesiredCapacity
- autoscaling:TerminateInstanceInAutoScalingGroup
Resource:
- "*"
NodeInstanceProfile:
Type: 'AWS::IAM::InstanceProfile'
Properties:
Path: /
InstanceProfileName: !Join [ "-", [!Ref RoleNamePrefix, "KarpenterInstanceProfile"] ]
Roles:
- !Ref NodeRole
ControllerRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join [ "-", [!Ref RoleNamePrefix, "KarpenterCtrlRole"] ]
AssumeRolePolicyDocument: !Sub
- |
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:${AWS::Partition}:iam::${AWS::AccountId}:oidc-provider/${OidcEndpoint}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OidcEndpoint}:aud": "sts.amazonaws.com",
"${OidcEndpoint}:sub": "system:serviceaccount:karpenter:karpenter"
}
}
}
]
}
- OidcEndpoint: !Ref OidcEndpoint
Path: /
Policies:
- PolicyName: karpenter-controller
PolicyDocument: !Sub
- |
{
"Statement": [
{
"Action": [
"ssm:GetParameter",
"ec2:DescribeImages",
"ec2:RunInstances",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeInstances",
"ec2:DescribeInstanceTypes",
"ec2:DescribeInstanceTypeOfferings",
"ec2:DescribeAvailabilityZones",
"ec2:DeleteLaunchTemplate",
"ec2:CreateTags",
"ec2:CreateLaunchTemplate",
"ec2:CreateFleet",
"ec2:DescribeSpotPriceHistory",
"pricing:GetProducts"
],
"Effect": "Allow",
"Resource": "*",
"Sid": "Karpenter"
},
{
"Action": "ec2:TerminateInstances",
"Condition": {
"StringLike": {
"ec2:ResourceTag/karpenter.sh/provisioner-name": "*"
}
},
"Effect": "Allow",
"Resource": "*",
"Sid": "ConditionalEC2Termination"
},
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${NodeRole}",
"Sid": "PassNodeIAMRole"
},
{
"Effect": "Allow",
"Action": "eks:DescribeCluster",
"Resource": "arn:${AWS::Partition}:eks:${AWS::Region}:${AWS::AccountId}:cluster/${ClusterName}",
"Sid": "EKSClusterEndpointLookup"
}
],
"Version": "2012-10-17"
}
- ClusterName: !Ref EksClusterName
- PolicyName: karpenter-interruption
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- "sqs:DeleteMessage"
- "sqs:GetQueueAttributes"
- "sqs:GetQueueUrl"
- "sqs:ReceiveMessage"
Resource: !Sub
- "arn:${AWS::Partition}:sqs:${AWS::Region}:${AWS::AccountId}:${SqsQueueName}"
- SqsQueueName: !GetAtt [ SqsQueue, "QueueName" ]
SqsQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: !Ref QueueName
MessageRetentionPeriod: 300
SqsManagedSseEnabled: true
SqsQueuePolicy:
Type: AWS::SQS::QueuePolicy
Properties:
Queues:
- !Ref SqsQueue
PolicyDocument:
Statement:
- Action:
- "SQS:SendMessage"
Effect: "Allow"
Resource: !Sub
- "arn:${AWS::Partition}:sqs:${AWS::Region}:${AWS::AccountId}:${SqsQueueName}"
- SqsQueueName: !GetAtt [ SqsQueue, "QueueName" ]
Principal:
Service:
- "events.amazonaws.com"
- "sqs.amazonaws.com"
HealthEventRule:
Type: 'AWS::Events::Rule'
Properties:
Description: AWS Health Event to SQS Queue
Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "HE" ] ]
State: ENABLED
EventPattern:
detail:
service:
- EC2
detail-type:
- AWS Health Event
source:
- aws.health
Targets:
- Arn: !GetAtt [ SqsQueue, "Arn" ]
Id: 'KarpenterInterruptionQueueTarget'
SpotInterruptionRule:
Type: 'AWS::Events::Rule'
Properties:
Description: AWS Spot Interruption to SQS Queue
Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "SI" ] ]
State: ENABLED
EventPattern:
detail-type:
- EC2 Spot Instance Interruption Warning
source:
- aws.ec2
Targets:
- Arn: !GetAtt [ SqsQueue, "Arn" ]
Id: 'KarpenterInterruptionQueueTarget'
InstanceRebalanceRule:
Type: 'AWS::Events::Rule'
Properties:
Description: AWS Instance Rebalance to SQS Queue
Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "IR" ] ]
State: ENABLED
EventPattern:
detail-type:
- EC2 Instance Rebalance Recommendation
source:
- aws.ec2
Targets:
- Arn: !GetAtt [ SqsQueue, "Arn" ]
Id: 'KarpenterInterruptionQueueTarget'
InstanceStateChangeRule:
Type: 'AWS::Events::Rule'
Properties:
Description: AWS Instance State Change to SQS Queue
Name: !Join [ "-", [ !Ref EventBridgeRulePrefix, "ISC" ] ]
State: ENABLED
EventPattern:
detail:
state:
- stopping
- terminated
- shutting-down
- stopped
detail-type:
- EC2 Instance State-change Notification
source:
- aws.ec2
Targets:
- Arn: !GetAtt [ SqsQueue, "Arn" ]
Id: 'KarpenterInterruptionQueueTarget'
Outputs:
NodeRoleARN:
Value: !GetAtt [ NodeRole, "Arn" ]
InstanceRoleProfileName:
Value: !Ref NodeInstanceProfile
ControllerRoleARN:
Value: !GetAtt [ ControllerRole, "Arn" ]
QueueName:
Value: !GetAtt [ SqsQueue, "QueueName" ]
Versions:
- Karpenter: v0.31.0
- Chart Version: 0.16.3
- Kubernetes Version (
kubectl version
): 1.25
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 18 (7 by maintainers)
I want to make sure I understand the scenario here. If the storage class doesn’t exist the PVCs shouldn’t be able to provision PVs and the pods should fail to schedule. Or are the PVs being manually provisioned rather than via a storage class (I assume this might be the case given the name
manual
)? I’ve been able to recreate the error with manually provisioned PVs but haven’t seen Karpenter stuck “waiting on cluster sync”. If you’re still able to provide YAMLs for your resources and Karpenter logs that would be helpful in determining the exact cause.Hi @jonathan-innis I will investigate in depth the log history for my clusters where Karpenter is installed with v0.27.3. I’ll let you know if the error log I encountered in the past is the same or different, I recall encountering similar logs with StorageClasses before v0.31.0
The issue was solved creating the storage class “manual” in my case
The question now is: is this behavior intended? I remember the same type of “Error” logs in the previous versions and we didn’t have problems @jonathan-innis
@samuel-esp
I’ve resolved the issue today. there was a configuration error on a statefulset regarding designating wrong storageclass name, which causing following error
after fixing the storageclass error, karpenter finally came to work normal.
You should checkout if there’s a configuration problem which makes karpenter stop working.