kubernetes: CrashLoopBackoff Pod cannot be evicted because of PodDisruptionBudget

What happened:

My StatefulSet created 3 Pods. They got deployed onto the same node and in CrashLoopBackoff state for some reason. PodDisruptionBudget was set for them with default 25% maxUnavailable. I created new Node and ran kubectl drain --force=true against the old Node where 3 CrashLoopBackoff Pods were running. kubelet could not evict any of Pods in CrashLoopBackoff state because of PodDisruptionBudget.

What you expected to happen:

Pods in CrashLoopbackBackoff are effectively considered dead, so I expect that kubelet can evict them and scheduler creates new Pod in the another Node.

How to reproduce it (as minimally and precisely as possible):

  1. Create 1 worker k8s cluster.
  2. Deploy StatefulSet with replicas count 3 and make it CrashLoopBackoff. Set PodDisruptionBudget for them with 25% maxUnavailable.
  3. Create new worker Node.
  4. Run kubectl drain --force=true against old Node.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.13.1
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 17
  • Comments: 62 (39 by maintainers)

Most upvoted comments

This is a big problem for us because our developers will often test broken stuff in our development clusters and leave it there and it breaks cluster-autoscaler and our other automation. Pod sets that are “too far gone” should not block eviction

https://github.com/kubernetes/kubernetes/pull/105296 is proposed to always allow evicting not ready pods, which (I think) would address this issue

Yes, https://github.com/kubernetes/kubernetes/pull/113375 enables resolving this as an alpha feature, by setting spec.unhealthyPodEvictionPolicy to AlwaysAllow in the PDB

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3017-pod-healthy-policy-for-pdb / https://github.com/kubernetes/enhancements/issues/3017 is where to track the progress of that feature to beta → GA

So what happens is that the disruption controller decides allowedDisruption based on the ready condition of pods. So while the PDB in the example above initially starts out allowing 1 disruption, whenever the container in one or more of the pods completes, this will be reduced to 0. Note that the phase of the pod remains Running even while the container are being restarted.

The eviction controller looks at the phase of pods when deciding whether they can be evicted and pods in either the Failed or Complete phases are evicted without considering PDBs. But in the example above, the pod phase remains Running, so eviction of these pods will always be subject to the PDB. In the example the PDB will block eviction unless all the three pods have the Ready condition set to True.

https://github.com/kubernetes/kubernetes/pull/83906 will allow deletion of pods in the pending phase without looking at the PDB, but it will not fix this situation since the pods remain in the Running phase. To allow eviction of pods which are in CrashLoopBackoff, the eviction controller would have to look at pod conditions and possibly the container statuses.

Thank you for fixing this long standing issue.

statefulSets and deployments control the number of pods in existence. As with Jobs, we could set the pod’s restartPolicy:Never, and that should cause the pod to terminate if it crashes, and the statefulSet should recreate it. But unlike Jobs, statefulSets have no mechanism for controlling behavior on pod failures, and therefore it would try to create the pod forever as fast as possible. Perhaps the right path is to enhance statefulSets/deployments to throttle pod restarts, and some of the learnings from Jobs about keeping the pod around on failure so that it can be debugged.

restartPolicy:Never allows me to say there is no interesting state on the node when the pod is down.

We want to evict a pod that is currently unavailable (involuntarily disrupted), and the PDB is trying to maintain a number of available pods, and actions of the unavailable pods do not affect the purpose of the PDB. But the wording for the PDB is around not evicting when number of available pods is below some threshold, but it would make more sense if it was worded to not evict if the number of available pods is below/at some threshold if the eviction will decrease the number of available pods.

The statement that a pod in CrashLoopBackoff is running just because kubelet says so seems is not a strong argument.

I think the crux of the issue is this:

Pods in CrashLoopbackBackoff are effectively considered dead,

They are not. They are considered Running, and that’s how the Kubelet will account for them.

Arguably there were other things wrong, but Pods in CrashLoopBackoff or Pending are already disrupted, so no evicting them from a draining node makes no sense.

The kubelet has no reliable way to know why they are disrupted. A Pending pod is waiting to start; it might have just been scheduled. A CrashLoopBackoff pod could have restarted for any number of reasons: failing liveness probe, disruption on the host outside of Kubernetes’ control such as the OOMKiller, an application level failure, etc.

Perhaps this is a user expectation mismatch and we need to update documentation? There might be an opportunity to improve the docs or UX around PodDisruptionBudget.

/kind documentation /help