karpenter-provider-aws: Message "slow event handlers blocking the queue"

Version

Karpenter Version: v0.27.2

Kubernetes Version: v1.24.10

Expected Behavior

Actual Behavior

We see a lot of messages like this: DeltaFIFO Pop Process" ID:header-service/header-service-7c6995b78b-5zmnp,Depth:14,Reason:slow event handlers blocking the queue (18-Apr-2023 10:15:25.363) What should we do to fix this problem? Is this a bug?

Steps to Reproduce the Problem

Install karpenter in a cluster

Resource Specs and Logs

2023-04-18T10:19:52.170Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-100-65-111-115.eu-central-1.compute.internal/c5d.xlarge/spot    {"commit": "d01ea11-dirty"}
2023-04-18T10:19:52.365Z    INFO    controller.termination    cordoned node    {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
2023-04-18T10:19:55.415Z    INFO    controller.termination    deleted node    {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
I0418 10:20:12.465949       1 trace.go:219] Trace[231923973]: "DeltaFIFO Pop Process" ID:infra-ingress-auth-service/ingress-auth-service-59d4d96dbb-lsxxc,Depth:16,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.364) (total time: 101ms):
Trace[231923973]: [101.12991ms] [101.12991ms] END
I0418 10:20:12.764013       1 trace.go:219] Trace[922268623]: "DeltaFIFO Pop Process" ID:xxx-cms-dev/discover-frontend-public-7886477494-n6sxs,Depth:18,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.578) (total time: 185ms):
Trace[922268623]: [185.782794ms] [185.782794ms] END
I0418 10:20:13.165725       1 trace.go:219] Trace[474204851]: "DeltaFIFO Pop Process" ID:gitlab-ci-template-fix-tags/gitlab-ci-template-5655f4df7-2b5dm,Depth:17,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.764) (total time: 401ms):
Trace[474204851]: [401.631907ms] [401.631907ms] END
2023-04-18T10:20:21.669Z    INFO    controller.deprovisioning    deprovisioning via consolidation delete, terminating 1 machines ip-100-65-214-196.eu-central-1.compute.internal/c5d.xlarge/spot    {"commit": "d01ea11-dirty"}
2023-04-18T10:20:21.773Z    INFO    controller.termination    cordoned node    {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}
2023-04-18T10:20:24.064Z    INFO    controller.termination    deleted node    {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

Original URL
State: open
Created a year ago
Reactions: 4
Comments: 24 (13 by maintainers)

Most upvoted comments

@dsouzajaison this could be a resources issue on the Karpenter deployment (mainly CPU) try adding more CPU to your deployment resource requests.

nalshamaajc on Jul 4, 2023

Interesting data. I wouldn’t be surprised if this was a noisy neighbor issue given the node was under contention. It’s hard to know more without reproducing myself. Will leave this open as some future performance work from our side.

ellistarn on May 9, 2023

@ellistarn No not yet, I was considering adding CPU after checking CPU utilization if that was being collected by our monitoring platform. I understand that you wish to see if CPU is reaching it’s threshold to prove the above assumption of CPU being the botelneck am I right?

nalshamaajc on May 9, 2023

If you look into the values.yaml those are now commented out.

You’re right @FernandoMiguel. We’ve yet to post general guidance related to pod count/node count and how that relates to cpu/memory usage. This is on our bucket list of things to do, but up to this point the cpu/memory requests/limits that users use is very user-specific and anecdotal.

jonathan-innis on Apr 19, 2023