karpenter-provider-aws: Karpenter provisioned nodes do not always terminate
Is an existing page relevant? https://karpenter.sh/v0.13.2/tasks/deprovisioning/
What karpenter features are relevant? Deprovisioning nodes.
We are currently using Kubernetes version 1.22
and the latest version of Karpenter which is 0.13.2
. Our use case is that we use Kubernetes and Karpenter for our Jenkins agents. This allows us to scale out when Jenkins is under load but scale in as the demand falls off. For the most part, it’s scaling fine but sometimes as the Jenkins job completes and the pod is removed from the node might not terminate. We have caught instances where the node is hung and is running for months.
We install some additional helm charts on our cluster such as fluent-bit
, aws-efs-csi-driver
, aws-load-balancer-controller
, etc… These will create pods and schedule them onto the node provisioned for the Jenkins job. For various reasons: storage, networking, logs/metrics, etc… After the Jenkins job pods are done, the supporting pods mentioned earlier are still running on the node. So, I thought that was the case since the node wasn’t empty Karpenter didn’t expire it. Although I see Karpenter provisioned nodes working correctly probably 80-90% of the time, they have the same pods getting scheduled. This killed my theory of additional running pods on the node and not allowing it to expire. Here is an example of the additional supporting pods that get scheduled onto the node:
Here is another example that is a real example of this. This node was created nearly 1 day ago. There are no Jenkins jobs running on it so all jobs are completed. It only has supporting Pods that are running for logs/metrics, storage, etc… But, the node never terminated after the jobs are completed. As mentioned above I have seen nodes like this will all the support pods scheduled to it terminate just fine after the Jenkins job has been completed.
How should the docs be improved? N/A
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 20 (9 by maintainers)
I’d add a log in
func (r *Emptiness) Reconcile(ctx context.Context, provisioner *v1alpha5.Provisioner, n *v1.Node) (reconcile.Result, error) {
everywhere it returns before adding the emptiness TTL.