kubernetes: DeletionTimeStamp not set for some evicted pods

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: When a node starts to evict pods under disk pressure, the DeletionTimestamp for some evicted pods are not set properly and still have zero value. It seems that pods created through Deployment are having this issue while pods created through DaemonSet have DeletionTimestamp set properly.

What you expected to happen: Pods created through deployment should also have DeletionTimestamp set properly.

How to reproduce it (as minimally and precisely as possible): Write an app to watch apiserver pods related events. Deploy a debian toolbox pod on one node using Deployment. Put that node under disk pressure like using more than 90% of disk space and consume more disk space from inside the toolbox pod (you can install some packages which uses lots of disk space like gnome-core for debian).

Anything else we need to know?: You can find some events related with pod and they only have Phase updated to “Failed” but without setting DeletionTimestamp which still has zero value.

Environment:

Kubernetes version (use kubectl version): 1.8.1
Cloud provider or hardware configuration**: AWS
OS (e.g. from /etc/os-release): Container Linux by CoreOS 1520.4.0
Kernel (e.g. uname -a): Linux ip-10-150-64-105 4.13.3-coreos #1 SMP Wed Sep 20 22:17:11 UTC 2017 x86_64 Intel® Xeon® CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 1
Comments: 40 (29 by maintainers)

Most upvoted comments

Yes, this is intentional. In order for evicted pods to be inspected after eviction, we do not remove the pod API object. Otherwise it would appear that the pod simply disappeared We do still stop and remove all containers, clean up cgroups, unmount volumes, etc to ensure that we reclaim all resources that were in use by the pod. I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.

+12

dashpole on Oct 27, 2017

In order for evicted pods to be inspected after eviction, we do not remove the pod API object.

If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don’t do that today.

For something like StatefulSet, it’s actually necessary to immediately delete any Pods evicted by kubelet, so the Pod name can be reused. As @janetkuo also mentioned, DaemonSet does this as well. For such controllers, you’re thus not gaining anything from kubelet leaving the Pod record.

Even for something like ReplicaSet, it probably makes the most sense for the controller to delete Pods evicted by kubelet (though it doesn’t do that now, see #60162) to avoid carrying along Failed Pods indefinitely.

So I would argue that in pretty much all cases, Pods with restartPolicy: Always that go to Failed should be expediently deleted by some controller, so users can’t expect such Pods to stick around.

If we can agree that some controller should delete them, the only question left is which controller? I suggest that the Node controller makes the most sense: delete any Failed Pods with restartPolicy: Always that are scheduled to me. Otherwise, we effectively shift the responsibility to “all Pod/workload controllers that exist or ever will exist.” Given the explosion of custom controllers that’s coming thanks to CRD, I don’t think it’s prudent to put that responsibility on every controller author.

Otherwise it would appear that the pod simply disappeared

With the /eviction subresource and Node drains, we have already set the precedent that your Pods might simply disappear (if the eviction succeeds, the Pod is deleted from the API server) at any time, without a trace.

enisoc on Feb 22, 2018

I see the kubelet sync loop construct a pod status like what you describe if an internal module decides the pod should be evicted:

https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/kubelet_pods.go#L1293-L1305

The kubelet then syncs status back to the API server: https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/status/status_manager.go#L437-L488

But unless the pod’s deletion timestamp is already set, the kubelet won’t delete the pod: https://github.com/kubernetes/kubernetes/blob/b00c15f1a40162d46fc4b96f4e6714f20aef9e6c/pkg/kubelet/status/status_manager.go#L504-L509

@kubernetes/sig-node-bugs that doesn’t seem like the kubelet does a complete job of evicting the pod from the API’s perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)

liggitt on Oct 27, 2017

In order for evicted pods to be inspected after eviction, we do not remove the pod API object.

If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don’t do that today.

I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.

DaemonSet controller actively deletes failed pods (#40330), to ensure that DaemonSet can recover from transient errors (#36482). Evicted DaemonSet pods get killed just because they’re also failed pods.

/remove-lifecycle stale

janetkuo on Feb 22, 2018