karpenter-provider-aws: Message "slow event handlers blocking the queue"
Version
Karpenter Version: v0.27.2
Kubernetes Version: v1.24.10
Expected Behavior
na
Actual Behavior
We see a lot of messages like this: DeltaFIFO Pop Process" ID:header-service/header-service-7c6995b78b-5zmnp,Depth:14,Reason:slow event handlers blocking the queue (18-Apr-2023 10:15:25.363) What should we do to fix this problem? Is this a bug?
Steps to Reproduce the Problem
Install karpenter in a cluster
Resource Specs and Logs
2023-04-18T10:19:52.170Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-100-65-111-115.eu-central-1.compute.internal/c5d.xlarge/spot {"commit": "d01ea11-dirty"}
2023-04-18T10:19:52.365Z INFO controller.termination cordoned node {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
2023-04-18T10:19:55.415Z INFO controller.termination deleted node {"commit": "d01ea11-dirty", "node": "ip-100-65-111-115.eu-central-1.compute.internal"}
I0418 10:20:12.465949 1 trace.go:219] Trace[231923973]: "DeltaFIFO Pop Process" ID:infra-ingress-auth-service/ingress-auth-service-59d4d96dbb-lsxxc,Depth:16,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.364) (total time: 101ms):
Trace[231923973]: [101.12991ms] [101.12991ms] END
I0418 10:20:12.764013 1 trace.go:219] Trace[922268623]: "DeltaFIFO Pop Process" ID:xxx-cms-dev/discover-frontend-public-7886477494-n6sxs,Depth:18,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.578) (total time: 185ms):
Trace[922268623]: [185.782794ms] [185.782794ms] END
I0418 10:20:13.165725 1 trace.go:219] Trace[474204851]: "DeltaFIFO Pop Process" ID:gitlab-ci-template-fix-tags/gitlab-ci-template-5655f4df7-2b5dm,Depth:17,Reason:slow event handlers blocking the queue (18-Apr-2023 10:20:12.764) (total time: 401ms):
Trace[474204851]: [401.631907ms] [401.631907ms] END
2023-04-18T10:20:21.669Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 machines ip-100-65-214-196.eu-central-1.compute.internal/c5d.xlarge/spot {"commit": "d01ea11-dirty"}
2023-04-18T10:20:21.773Z INFO controller.termination cordoned node {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}
2023-04-18T10:20:24.064Z INFO controller.termination deleted node {"commit": "d01ea11-dirty", "node": "ip-100-65-214-196.eu-central-1.compute.internal"}
Community Note
- Please vote on this issue by adding a đź‘Ť reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 4
- Comments: 24 (13 by maintainers)
@dsouzajaison this could be a resources issue on the Karpenter deployment (mainly CPU) try adding more CPU to your deployment resource requests.
Interesting data. I wouldn’t be surprised if this was a noisy neighbor issue given the node was under contention. It’s hard to know more without reproducing myself. Will leave this open as some future performance work from our side.
@ellistarn No not yet, I was considering adding CPU after checking CPU utilization if that was being collected by our monitoring platform. I understand that you wish to see if CPU is reaching it’s threshold to prove the above assumption of CPU being the botelneck am I right?
You’re right @FernandoMiguel. We’ve yet to post general guidance related to pod count/node count and how that relates to cpu/memory usage. This is on our bucket list of things to do, but up to this point the cpu/memory requests/limits that users use is very user-specific and anecdotal.