karpenter: Cordon node with a do-not-evict pod when ttlSecondsUntilExpired is met

Tell us about your request

Regarding the do-not-evict annotation, it currrently prevents some nodes to be deprovisioned on our clusters.

I know, that’s the goal and i’m fine with it, but would it be possible to cordon the node at least when the ttlSecondsUntilExpired is met ?

Tell us about the problem you’re trying to solve. What are you trying to do, and why is it hard?

We provision expensive nodes as a failback for some workloads with a ttl of 3 hours, and we saw some still running with a 6 hours uptime. The annotation was preventing the drain/deletion as cronjobs kept being scheduled on it, but a manual cordon fixed it within minutes.

Are you currently working around this issue?

No

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 19
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Related to this, I think I’d like to see a counter metric for amount of time that a node has exceeded it’s TTL. Whilst that’s climbing, there’s a problem. If it stays climbing, there’s a bigger problem.

I’m not sure if I’d want a series per provisioner or per (provisioner,node name) tuple.

Overall, that lets me as an operator manually intervene to implement Forceful, so I don’t end up with a very stale node by accident.

This is tricky, since there are a number of termination cases, and the desired behavior isn’t always clear. One of the ideas we’ve come to is thinking in terms of voluntary and involuntary disruptions.

Voluntary Disruptions:

  • Drift (AMI is out of date, requirements no longer match provisioner, etc)
  • Consolidation (emptiness, replacement, etc)

Involuntary Disruption:

  • Spot Interruption
  • Node Health
  • Expiration (this one is debatably voluntary)

Potential termination behaviors:

  • [Opportunistic] Don’t cordon the node until we think we can successfully drain all of its pods
  • [Eventual] Cordon the node, but don’t drain it until we’re sure we can drain all of them
  • [Graceful] Cordon the node, drain it, respect PDBs/do-not-evict
  • [Forceful] Cordon the node, drain it, don’t respect PDBs/do-not-evict

We want to “do the right thing” wherever possible, and have this be something that users simply don’t think about.

  • Is it possible for us to pick one behavior for each type of disruption?
  • Do users need to be able to configure different behaviors for different types of disruption?
  • If configurable, should it be per provisioner?

If I were to guess at the right set of defaults, it might be something like:

  • Drift -> Opportunistic
  • Consolidation -> Opportunistic
  • Spot Interruption -> Forceful
  • Node Health -> Forceful
  • Expiration -> ???

@jukie sorry for the unfortunate timing, as I just came off vacation. Just noticed you asked here, but we just had a WG an hour ago. You can check when and where these meetings are happening here: https://github.com/kubernetes-sigs/karpenter#community-discussion-contribution-and-support. You can message me or @jonathan-innis in the kubernetes slack as well if you want to discuss more.

Has there been any more discussions on this

I think we should consider discussing this in the next working group. I’d like to see a design on this one considering the impact that this could have to the cluster, given there could be cases if we’re not careful where all nodes on the cluster get tainted all at once.

As an example, if we have a set of nodes that are expired all at once and we cordon them all at once, is that a reasonable stance to take, knowing that those nodes will get eventually disrupted? Probably, assuming that our disruption logic can handle the disruption quick enough that have underutilization on that node isn’t a problem, but what if that node has a fully-blocking PDB? What if that node has a do-not-evict pod and that pod never goes away? Do we need some forceful termination that happens after a node has been cordoned with a blocking pod for some amount of time? There might be some interactions here that we should consider with #743