karpenter-provider-aws: Getting "check failed, expected resource XX didn't register on the node" after updating from v0.27.1 to v0.27.3
Version
Karpenter Version: v0.27.3 [Latest]
Kubernetes Version: v1.25.6-eks-48e63af
OS: BottleRocket
Expected Behavior
Not getting any errors after updating.
Actual Behavior
-
I started getting logs in Karpenter that are related to one of the nodes that are managed by Karpenter
ip-10-26-6-187.us-west-2.compute.internal(Shown last in the below results) is facing. issues . These logs started apearing after around 2 minutes of the update taking effect. -
I use ArgoCD to manage Karpenter.
-
The only state I caught this node in was the following
ip-10-26-6-187.us-west-2.compute.internal Ready,SchedulingDisabled <none> 6d6h v1.25.8-eks-c05fe32
-
It also shows that this node is 6d6h old NodeTTL is set to 7 days.
-
Example of the errors I started getting are the below
"check failed, expected resource \"memory\" didn't register on the node"
"check failed, expected resource \"cpu\" didn't register on the node"
"check failed, expected resource \"pods\" didn't register on the node"
"check failed, expected resource \"ephemeral-storage\" didn't register on the node"
"check failed, can't drain node, PDB [PDB-name]is blocking evictions"
➜ k top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-x-x-0-183.us-west-2.compute.internal 1349m 69% 2977Mi 42%
ip-x-x-14-134.us-west-2.compute.internal 1577m 9% 15706Mi 55%
ip-x-x-14-68.us-west-2.compute.internal 252m 13% 2283Mi 32%
ip-x-x-15-185.us-west-2.compute.internal 1950m 101% 3484Mi 49%
ip-x-x-15-249.us-west-2.compute.internal 489m 3% 4320Mi 15%
ip-x-x-2-23.us-west-2.compute.internal 2352m 7% 15198Mi 25%
ip-x-x-2-241.us-west-2.compute.internal 316m 16% 2135Mi 30%
ip-x-x-20-110.us-west-2.compute.internal 266m 13% 1760Mi 24%
ip-x-x-20-28.us-west-2.compute.internal 227m 11% 2267Mi 32%
ip-x-x-22-80.us-west-2.compute.internal 1550m 80% 4668Mi 66%
ip-x-x-30-154.us-west-2.compute.internal 169m 8% 2037Mi 28%
ip-x-x-31-45.us-west-2.compute.internal 686m 4% 4868Mi 17%
ip-x-x-32-228.us-west-2.compute.internal 930m 5% 7894Mi 27%
ip-x-x-34-248.us-west-2.compute.internal 195m 10% 1938Mi 27%
ip-x-x-4-203.us-west-2.compute.internal 234m 12% 1978Mi 28%
ip-x-x-41-78.us-west-2.compute.internal 232m 12% 2059Mi 29%
ip-x-x-42-48.us-west-2.compute.internal 599m 3% 4132Mi 14%
ip-x-x-43-161.us-west-2.compute.internal 984m 6% 15269Mi 53%
ip-x-x-44-67.us-west-2.compute.internal 1333m 69% 2552Mi 36%
ip-x-x-46-170.us-west-2.compute.internal 175m 9% 2096Mi 29%
ip-x-x-6-187.us-west-2.compute.internal <unknown> <unknown> <unknown> <unknown>
Steps to Reproduce the Problem
Updating from v0.27.1 to v0.27.3
Resource Specs and Logs
In the logs below, we can see an app reporting issues and being unable to schedule on the ArgoCD UI. The Reason at the beginning of the line is Nominated. The logs also show that Karpenter took action at a certain time by consolidating and terminating the node. It is possible that this occurred because one of my colleagues attempted to stop and restart the node from the AWS console. However, I was unable to investigate this thoroughly because the node was already being terminated by the time I became involved. We confirmed that the timestamps for both restarting the node and resolving the issue matched with a small difference, suggesting that stopping the node may have helped solve the issue.
Provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
namespace: karpenter
spec:
providerRef:
name: bottlerocket-template
labels:
provisioner: default
requirements:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-2a
- us-west-2b
- us-west-2c
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: "karpenter.sh/capacity-type"
operator: In
values:
- on-demand
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- '5'
- key: "karpenter.k8s.aws/instance-cpu"
operator: In
values:
- '8'
- '16'
- '24'
- '32'
- '36'
- key: "karpenter.k8s.aws/instance-size"
operator: In
values:
- '4xlarge'
- '6xlarge'
- '8xlarge'
consolidation:
enabled: true
weight: 50
LOGS ARE ATTACHED KarpenterLogs.txt
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 8
- Comments: 19 (7 by maintainers)
I am also facing same issue
The node is also “NotReady”. In that case karpenter is still trying to schedule on same node instead of spinning new node
We don’t monitor closed issues so I nearly missed this. It sounds like the nodes aren’t joining the cluster, you can troubleshoot this with the instructions at https://karpenter.sh/docs/troubleshooting/#node-launchreadiness