kubernetes: Kubelet does not delete evicted pods
/kind feature
What happened:
Kubelet has evicted pods due to disk pressure. Eventually, the disk pressure went away and the pods were scheduled and started again, but the evicted pods remained in the list of pods (kubectl get pod --show-all
).
What you expected to happen: Wouldn’t it be better if the kubelet would have deleted those evicted pods? The expected behaviour would therefore be to not see the evicted pods anymore, i.e. that they get deleted.
How to reproduce it (as minimally and precisely as possible):
Start kubelet with --eviction-hard
and --eviction-soft
with high thresholds or fill up the disk of a worker node.
Environment:
- Kubernetes version (use
kubectl version
): 1.8.2 - Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): Container Linux 1465.7.0 (Ladybug)
- Kernel (e.g.
uname -a
): 4.12.10-coreos
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 21
- Comments: 17 (8 by maintainers)
A quick workaround we use, is to delete all evicted pods manually after an incident:
kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'
Not as nice as automatic delete, but it works. (Tested with 1.6.7, i heard in 1.7 you need to add --show-all)I suppose, this issue can be closed, because the evicted pods deletion can be controlled through settings in kube-controller-manager.
For those k8s users who hit the kube-apiserver or etcd performance issues due to too many evicted pods, i would recommend updating the kube-controller-manager config to set
--terminated-pod-gc-threshold 100
or similar small value. The default GC threshold is 12500, which is too high for most etcd installations. Reading 12500 pod records from etcd takes seconds to complete.Also ask yourself why are there so many evicted pods? Maybe your kube-scheduler keeps scheduling pods on a node which already reports DiskPressure or MemoryPressure? This could be the case if the kube-scheduler is configured with a custom
--policy-config-file
which has noCheckNodeMemoryPressure
orCheckNodeDiskPressure
in the list of policy predicates.Why does kubernetes keep evicted pod, and what is the purpose of this design?
@so0k I created a cron job using a Yaml file with this config (need to fix the formatting, check https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/) :
apiVersion: batch/v1beta1 kind: CronJob metadata: name: delete-failed-pods spec: schedule: “*/30 * * * *” failedJobsHistoryLimit: 1 successfulJobsHistoryLimit: 1 jobTemplate: spec: template: spec: containers: - name: kubectl-runner image: wernight/kubectl command: [“sh”, “-c”, “kubectl get pods --all-namespaces --field-selector ‘status.phase==Failed’ -o json | kubectl delete -f -”] restartPolicy: OnFailure
Create the task with
kubectl create -f "PATH_TO_cronjob.yaml"
Check the status of the task with
kubectl get cronjob delete-failed-pods
Delete the task with
delete cronjob delete-failed-pods
@kabakaev - wouldn’t pod gc cover all pods (including terminated pods for other reasons) - what if we just want evicted pods to be cleaned up periodically?
Statefulset will auto delete Failed pod https://github.com/kubernetes/kubernetes/blob/52eea971c57580c6b1b74f0a12bf9cc6083a4d6b/pkg/controller/statefulset/stateful_set_control.go#L386-L393. For now Deployment and DaemonSet do not do this.