karpenter-provider-aws: Getting "check failed, expected resource XX didn't register on the node" after updating from v0.27.1 to v0.27.3

Version

Karpenter Version: v0.27.3 [Latest]

Kubernetes Version: v1.25.6-eks-48e63af

OS: BottleRocket

Expected Behavior

Not getting any errors after updating.

Actual Behavior

  • I started getting logs in Karpenter that are related to one of the nodes that are managed by Karpenter ip-10-26-6-187.us-west-2.compute.internal (Shown last in the below results) is facing. issues . These logs started apearing after around 2 minutes of the update taking effect.

  • I use ArgoCD to manage Karpenter.

  • The only state I caught this node in was the following

ip-10-26-6-187.us-west-2.compute.internal    Ready,SchedulingDisabled   <none>   6d6h    v1.25.8-eks-c05fe32
  • It also shows that this node is 6d6h old NodeTTL is set to 7 days.

  • Example of the errors I started getting are the below

"check failed, expected resource \"memory\" didn't register on the node"
"check failed, expected resource \"cpu\" didn't register on the node"
"check failed, expected resource \"pods\" didn't register on the node"
"check failed, expected resource \"ephemeral-storage\" didn't register on the node"
"check failed, can't drain node, PDB [PDB-name]is blocking evictions"
➜ k top node
NAME                                         CPU(cores)   CPU%        MEMORY(bytes)   MEMORY%
ip-x-x-0-183.us-west-2.compute.internal    1349m        69%         2977Mi          42%
ip-x-x-14-134.us-west-2.compute.internal   1577m        9%          15706Mi         55%
ip-x-x-14-68.us-west-2.compute.internal    252m         13%         2283Mi          32%
ip-x-x-15-185.us-west-2.compute.internal   1950m        101%        3484Mi          49%
ip-x-x-15-249.us-west-2.compute.internal   489m         3%          4320Mi          15%
ip-x-x-2-23.us-west-2.compute.internal     2352m        7%          15198Mi         25%
ip-x-x-2-241.us-west-2.compute.internal    316m         16%         2135Mi          30%
ip-x-x-20-110.us-west-2.compute.internal   266m         13%         1760Mi          24%
ip-x-x-20-28.us-west-2.compute.internal    227m         11%         2267Mi          32%
ip-x-x-22-80.us-west-2.compute.internal    1550m        80%         4668Mi          66%
ip-x-x-30-154.us-west-2.compute.internal   169m         8%          2037Mi          28%
ip-x-x-31-45.us-west-2.compute.internal    686m         4%          4868Mi          17%
ip-x-x-32-228.us-west-2.compute.internal   930m         5%          7894Mi          27%
ip-x-x-34-248.us-west-2.compute.internal   195m         10%         1938Mi          27%
ip-x-x-4-203.us-west-2.compute.internal    234m         12%         1978Mi          28%
ip-x-x-41-78.us-west-2.compute.internal    232m         12%         2059Mi          29%
ip-x-x-42-48.us-west-2.compute.internal    599m         3%          4132Mi          14%
ip-x-x-43-161.us-west-2.compute.internal   984m         6%          15269Mi         53%
ip-x-x-44-67.us-west-2.compute.internal    1333m        69%         2552Mi          36%
ip-x-x-46-170.us-west-2.compute.internal   175m         9%          2096Mi          29%
ip-x-x-6-187.us-west-2.compute.internal    <unknown>    <unknown>   <unknown>       <unknown>

Steps to Reproduce the Problem

Updating from v0.27.1 to v0.27.3

Resource Specs and Logs

In the logs below, we can see an app reporting issues and being unable to schedule on the ArgoCD UI. The Reason at the beginning of the line is Nominated. The logs also show that Karpenter took action at a certain time by consolidating and terminating the node. It is possible that this occurred because one of my colleagues attempted to stop and restart the node from the AWS console. However, I was unable to investigate this thoroughly because the node was already being terminated by the time I became involved. We confirmed that the timestamps for both restarting the node and resolving the issue matched with a small difference, suggesting that stopping the node may have helped solve the issue.

image

Provisioner

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
  namespace: karpenter
spec:
  providerRef:
    name: bottlerocket-template
  labels:
    provisioner: default

  requirements:
    - key: topology.kubernetes.io/zone
      operator: In
      values:
        - us-west-2a
        - us-west-2b
        - us-west-2c
    - key: kubernetes.io/arch
      operator: In
      values:
        - amd64
    - key: "karpenter.sh/capacity-type" 
      operator: In
      values:
        - on-demand
    - key: kubernetes.io/os
      operator: In
      values:
        - linux
    - key: karpenter.k8s.aws/instance-category
      operator: In
      values:
        - c
        - m
        - r
    - key: karpenter.k8s.aws/instance-generation
      operator: Gt
      values:
        - '5'
    - key: "karpenter.k8s.aws/instance-cpu"
      operator: In
      values:
        - '8'
        - '16'
        - '24'
        - '32'
        - '36'
    - key: "karpenter.k8s.aws/instance-size"
      operator: In
      values:
        - '4xlarge'
        - '6xlarge'
        - '8xlarge'

  consolidation:
    enabled: true
  weight: 50

LOGS ARE ATTACHED KarpenterLogs.txt

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 8
  • Comments: 19 (7 by maintainers)

Most upvoted comments

I am also facing same issue

controller.consistency  check failed, expected resource "memory" didn't register on the node    {"commit": "d7e22b1-dirty", "node": "ip-xxxxxxxx-.ap-southeast-1.compute.internal"}
controller.consistency  check failed, expected resource "cpu" didn't register on the node       {"commit": "d7e22b1-dirty", "node": "ip-xxxxxxxx-.ap-southeast-1.compute.internal"}
controller.consistency  check failed, expected resource "cpu" didn't register on the node       {"commit": "d7e22b1-dirty", "node": "ip-xxxxxxxx-.ap-southeast-1.compute.internal"}
controller.consistency  check failed, expected resource "ephemeral-storage" didn't register on the node {"commit": "d7e22b1-dirty", "node": "ip-xxxxxxxx-.ap-southeast-1.compute.internal"}

The node is also “NotReady”. In that case karpenter is still trying to schedule on same node instead of spinning new node

We don’t monitor closed issues so I nearly missed this. It sounds like the nodes aren’t joining the cluster, you can troubleshoot this with the instructions at https://karpenter.sh/docs/troubleshooting/#node-launchreadiness