kubernetes: Kubelet does not delete evicted pods

/kind feature

What happened: Kubelet has evicted pods due to disk pressure. Eventually, the disk pressure went away and the pods were scheduled and started again, but the evicted pods remained in the list of pods (kubectl get pod --show-all).

What you expected to happen: Wouldn’t it be better if the kubelet would have deleted those evicted pods? The expected behaviour would therefore be to not see the evicted pods anymore, i.e. that they get deleted.

How to reproduce it (as minimally and precisely as possible): Start kubelet with --eviction-hard and --eviction-soft with high thresholds or fill up the disk of a worker node.

Environment:

  • Kubernetes version (use kubectl version): 1.8.2
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Container Linux 1465.7.0 (Ladybug)
  • Kernel (e.g. uname -a): 4.12.10-coreos

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 21
  • Comments: 17 (8 by maintainers)

Most upvoted comments

A quick workaround we use, is to delete all evicted pods manually after an incident: kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1' Not as nice as automatic delete, but it works. (Tested with 1.6.7, i heard in 1.7 you need to add --show-all)

I suppose, this issue can be closed, because the evicted pods deletion can be controlled through settings in kube-controller-manager.

For those k8s users who hit the kube-apiserver or etcd performance issues due to too many evicted pods, i would recommend updating the kube-controller-manager config to set --terminated-pod-gc-threshold 100 or similar small value. The default GC threshold is 12500, which is too high for most etcd installations. Reading 12500 pod records from etcd takes seconds to complete.

Also ask yourself why are there so many evicted pods? Maybe your kube-scheduler keeps scheduling pods on a node which already reports DiskPressure or MemoryPressure? This could be the case if the kube-scheduler is configured with a custom --policy-config-file which has no CheckNodeMemoryPressure or CheckNodeDiskPressure in the list of policy predicates.

$ kube-controller-manager --help 2>&1|grep terminated
      --terminated-pod-gc-threshold int32                                 Number of terminated pods that can exist before the terminated pod garbage collector starts deleting terminated pods. If <= 0, the terminated pod garbage collector is disabled. (default 12500)

Why does kubernetes keep evicted pod, and what is the purpose of this design?

@so0k I created a cron job using a Yaml file with this config (need to fix the formatting, check https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/) :



apiVersion: batch/v1beta1 kind: CronJob metadata: name: delete-failed-pods spec: schedule: “*/30 * * * *” failedJobsHistoryLimit: 1 successfulJobsHistoryLimit: 1 jobTemplate: spec: template: spec: containers: - name: kubectl-runner image: wernight/kubectl command: [“sh”, “-c”, “kubectl get pods --all-namespaces --field-selector ‘status.phase==Failed’ -o json | kubectl delete -f -”] restartPolicy: OnFailure


Create the task with kubectl create -f "PATH_TO_cronjob.yaml"

Check the status of the task with kubectl get cronjob delete-failed-pods

Delete the task with delete cronjob delete-failed-pods

@kabakaev - wouldn’t pod gc cover all pods (including terminated pods for other reasons) - what if we just want evicted pods to be cleaned up periodically?