kubernetes: DeletionTimeStamp not set for some evicted pods
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: When a node starts to evict pods under disk pressure, the DeletionTimestamp for some evicted pods are not set properly and still have zero value. It seems that pods created through Deployment are having this issue while pods created through DaemonSet have DeletionTimestamp set properly.
What you expected to happen: Pods created through deployment should also have DeletionTimestamp set properly.
How to reproduce it (as minimally and precisely as possible): Write an app to watch apiserver pods related events. Deploy a debian toolbox pod on one node using Deployment. Put that node under disk pressure like using more than 90% of disk space and consume more disk space from inside the toolbox pod (you can install some packages which uses lots of disk space like gnome-core for debian).
Anything else we need to know?: You can find some events related with pod and they only have Phase updated to “Failed” but without setting DeletionTimestamp which still has zero value.
Environment:
- Kubernetes version (use
kubectl version
): 1.8.1 - Cloud provider or hardware configuration**: AWS
- OS (e.g. from /etc/os-release): Container Linux by CoreOS 1520.4.0
- Kernel (e.g.
uname -a
): Linux ip-10-150-64-105 4.13.3-coreos #1 SMP Wed Sep 20 22:17:11 UTC 2017 x86_64 Intel® Xeon® CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 1
- Comments: 40 (29 by maintainers)
Yes, this is intentional. In order for evicted pods to be inspected after eviction, we do not remove the pod API object. Otherwise it would appear that the pod simply disappeared We do still stop and remove all containers, clean up cgroups, unmount volumes, etc to ensure that we reclaim all resources that were in use by the pod. I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.
For something like StatefulSet, it’s actually necessary to immediately delete any Pods evicted by kubelet, so the Pod name can be reused. As @janetkuo also mentioned, DaemonSet does this as well. For such controllers, you’re thus not gaining anything from kubelet leaving the Pod record.
Even for something like ReplicaSet, it probably makes the most sense for the controller to delete Pods evicted by kubelet (though it doesn’t do that now, see #60162) to avoid carrying along Failed Pods indefinitely.
So I would argue that in pretty much all cases, Pods with
restartPolicy: Always
that go toFailed
should be expediently deleted by some controller, so users can’t expect such Pods to stick around.If we can agree that some controller should delete them, the only question left is which controller? I suggest that the Node controller makes the most sense: delete any
Failed
Pods withrestartPolicy: Always
that are scheduled to me. Otherwise, we effectively shift the responsibility to “all Pod/workload controllers that exist or ever will exist.” Given the explosion of custom controllers that’s coming thanks to CRD, I don’t think it’s prudent to put that responsibility on every controller author.With the
/eviction
subresource and Node drains, we have already set the precedent that your Pods might simply disappear (if the eviction succeeds, the Pod is deleted from the API server) at any time, without a trace.I see the kubelet sync loop construct a pod status like what you describe if an internal module decides the pod should be evicted:
https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/kubelet_pods.go#L1293-L1305
The kubelet then syncs status back to the API server: https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/status/status_manager.go#L437-L488
But unless the pod’s deletion timestamp is already set, the kubelet won’t delete the pod: https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/status/status_manager.go#L504-L509
@kubernetes/sig-node-bugs that doesn’t seem like the kubelet does a complete job of evicting the pod from the API’s perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)
If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don’t do that today.
DaemonSet controller actively deletes failed pods (#40330), to ensure that DaemonSet can recover from transient errors (#36482). Evicted DaemonSet pods get killed just because they’re also failed pods.
/remove-lifecycle stale