karpenter-provider-aws: Message "slow event handlers blocking the queue"

Version

Karpenter Version: v0.27.2

Kubernetes Version: v1.24.10

Expected Behavior

na

Actual Behavior

We see a lot of messages like this: DeltaFIFO Pop Process" ID:header-service/header-service-7c6995b78b-5zmnp,Depth:14,Reason:slow event handlers blocking the queue (18-Apr-2023 10:15:25.363) What should we do to fix this problem? Is this a bug?

Steps to Reproduce the Problem

Install karpenter in a cluster

Resource Specs and Logs

2023-04-18T10:19:52.170Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-100-65-111-115.eu-central-1.compute.internal/c5d.xlarge/spot    {"commit": "d01ea11-dirty"}
2023-04-18T10:19:52.365Z    INFO    controller.termination    cordoned node    {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
2023-04-18T10:19:55.415Z    INFO    controller.termination    deleted node    {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
I0418 10:20:12.465949       1 trace.go:219] Trace[231923973]: "DeltaFIFO Pop Process" ID:infra-ingress-auth-service/ingress-auth-service-59d4d96dbb-lsxxc,Depth:16,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.364) (total time: 101ms):
Trace[231923973]: [101.12991ms] [101.12991ms] END
I0418 10:20:12.764013       1 trace.go:219] Trace[922268623]: "DeltaFIFO Pop Process" ID:xxx-cms-dev/discover-frontend-public-7886477494-n6sxs,Depth:18,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.578) (total time: 185ms):
Trace[922268623]: [185.782794ms] [185.782794ms] END
I0418 10:20:13.165725       1 trace.go:219] Trace[474204851]: "DeltaFIFO Pop Process" ID:gitlab-ci-template-fix-tags/gitlab-ci-template-5655f4df7-2b5dm,Depth:17,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.764) (total time: 401ms):
Trace[474204851]: [401.631907ms] [401.631907ms] END
2023-04-18T10:20:21.669Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-100-65-214-196.eu-central-1.compute.internal/c5d.xlarge/spot    {"commit": "d01ea11-dirty"}
2023-04-18T10:20:21.773Z    INFO    controller.termination    cordoned node    {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}
2023-04-18T10:20:24.064Z    INFO    controller.termination    deleted node    {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}

Community Note

  • Please vote on this issue by adding a đź‘Ť reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 4
  • Comments: 24 (13 by maintainers)

Most upvoted comments

@dsouzajaison this could be a resources issue on the Karpenter deployment (mainly CPU) try adding more CPU to your deployment resource requests.

Interesting data. I wouldn’t be surprised if this was a noisy neighbor issue given the node was under contention. It’s hard to know more without reproducing myself. Will leave this open as some future performance work from our side.

@ellistarn No not yet, I was considering adding CPU after checking CPU utilization if that was being collected by our monitoring platform. I understand that you wish to see if CPU is reaching it’s threshold to prove the above assumption of CPU being the botelneck am I right?

If you look into the values.yaml those are now commented out.

You’re right @FernandoMiguel. We’ve yet to post general guidance related to pod count/node count and how that relates to cpu/memory usage. This is on our bucket list of things to do, but up to this point the cpu/memory requests/limits that users use is very user-specific and anecdotal.