descheduler: deschedule pods that fail to start or restart too often

It is not uncommon that pods get scheduled on nodes that are not able to start it. For example, a node may have network issues and unable to mount a networked persistent volume, or cannot pull a docker image, or has some docker configuration issue which is seen only on container startup.

Another common issue is when a container gets restarted by liveliness check because of some local node issue (e.g. wrong routing table, slow storage, network latency or packet-drop). In that case, a pod is unhealthy most of the time and hangs in a restart state forever without a chance of being migrated to another node.

As of now, there is no possibility to re-schedule pods with faulty containers. It may be helpful to introduce two new Strategies:

  • container-restart-rate: re-schedule a pod if it is unhealthy since $notReadyPeriod seconds and one of its containers was restarted $maxRestartCount times.
  • pod-startup-failure: a pod was scheduled on a node, but was unable to start all of its containers since $maxStartupTime seconds.

The similar issue is filed against kubernetes: https://github.com/kubernetes/kubernetes/issues/13385

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 44 (27 by maintainers)

Commits related to this issue

Most upvoted comments

@seanmalloy @ingvagabund @kabakaev Does anyone plan to work on this? If not, I’d love to help with the feature.

@lixiang233 this feature enhancement is all yours. Thanks!

@ingvagabund I think the use case is to deschedule pods that are Pending for a short period of time.

Yeah, with such a short period of time, it makes sense to limit the phase. Though, maybe not to every phase. Pending is the first phase when a pod is accepted. I can’t find any field in pod’ status saying when a pod transitioned into a given phase. Also, other phases are completely ignored (Failed, Succeeded) which leaves only Running and Unknown. Running is the default one in most cases. podStatusPhase field is fine though I would just limit it to Pending and Running right now.

I’d imagine an extra descheduler policy, which evicts a pod in status.phase != Running for more than a configured period since metadata.creationTimestamp

@kabakaev thanks for the info. How about using the PodLifeTime strategy? We would need to add an additional strategy parameter to handle status.phase != Running.

Maybe something like this …

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "PodLifeTime":
     enabled: true
     params:
        maxPodLifeTimeSeconds: 300
        podStatusPhase:
        - pending

@damemi @ingvagabund @lixiang233 please add any additional ideas you have. Thanks!

Seems like a reasonable ask. @kabakaev I am planning to defer this to 0.6 release or later. Hope you are ok with that.