kubernetes: ignored pod-eviction-timeout settings
What happened: I modified the pod-eviction-timeout settings of kube-controller-manager on the master node (in order to to decrease the amount of time before k8s re-creates a pod in case of node failure). The default value is 5 minutes, I configured 30 seconds. Using the sudo docker ps --no-trunc | grep "kube-controller-manager" command I checked that the modification was successfully applied:
kubeadmin@nodetest21:~$ sudo docker ps --no-trunc | grep "kube-controller-manager"
387261c61ee9cebce50de2540e90b89e2bc710b4126a0c066ef41f0a1fb7cf38 sha256:0482f640093306a4de7073fde478cf3ca877b6fcc2c4957624dddb2d304daef5 "kube-controller-manager --address=127.0.0.1 --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --client-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --leader-elect=true --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --root-ca-file=/etc/kubernetes/pki/ca.crt --service-account-private-key-file=/etc/kubernetes/pki/sa.key --use-service-account-credentials=true --pod-eviction-timeout=30s"
I applied a basic deployment with two replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
restartPolicy: Always
The first pod created on the first worker node, the second pod created on the second worker node:
NAME STATUS ROLES AGE VERSION
nodetest21 Ready master 34m v1.13.3
nodetest22 Ready <none> 31m v1.13.3
nodetest23 Ready <none> 30m v1.13.3
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default busybox-74b487c57b-5s6g7 1/1 Running 0 13s 10.44.0.2 nodetest22 <none> <none>
default busybox-74b487c57b-6zdvv 1/1 Running 0 13s 10.36.0.1 nodetest23 <none> <none>
kube-system coredns-86c58d9df4-gmcjd 1/1 Running 0 34m 10.32.0.2 nodetest21 <none> <none>
kube-system coredns-86c58d9df4-wpffr 1/1 Running 0 34m 10.32.0.3 nodetest21 <none> <none>
kube-system etcd-nodetest21 1/1 Running 0 33m 10.0.1.4 nodetest21 <none> <none>
kube-system kube-apiserver-nodetest21 1/1 Running 0 33m 10.0.1.4 nodetest21 <none> <none>
kube-system kube-controller-manager-nodetest21 1/1 Running 0 20m 10.0.1.4 nodetest21 <none> <none>
kube-system kube-proxy-6mcn8 1/1 Running 1 31m 10.0.1.5 nodetest22 <none> <none>
kube-system kube-proxy-dhdqj 1/1 Running 0 30m 10.0.1.6 nodetest23 <none> <none>
kube-system kube-proxy-vqjg8 1/1 Running 0 34m 10.0.1.4 nodetest21 <none> <none>
kube-system kube-scheduler-nodetest21 1/1 Running 1 33m 10.0.1.4 nodetest21 <none> <none>
kube-system weave-net-9qls7 2/2 Running 3 31m 10.0.1.5 nodetest22 <none> <none>
kube-system weave-net-h2cb6 2/2 Running 0 33m 10.0.1.4 nodetest21 <none> <none>
kube-system weave-net-vkb62 2/2 Running 0 30m 10.0.1.6 nodetest23 <none> <none>
To test the correct pod eviction I shutdown the first worker node. After ~1 min the status of the first worker node changed to “NotReady”, then I had to wait +5 minutes (which is the default pod eviction timeout) for pod on the turned off node to be re-created on the other node.
What you expected to happen: After the node status reports “NotReady”, the pod should be re-created on the other node after 30 seconds instead if the default 5 minutes!
How to reproduce it (as minimally and precisely as possible):
Create three nodes. Init Kubernetes on the first node (sudo kubeadm init), apply network plugin (kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"), then join the other two nodes (like: kubeadm join 10.0.1.4:6443 --token xdx9y1.z7jc0j7c8g8lpjog --discovery-token-ca-cert-hash sha256:04ae8388f607755c14eed702a23fd47802d5512e092b08add57040a2ae0736ac).
Add pod-eviction-timeout parameter to Kube Controller Manager on the master node: sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml:
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
creationTimestamp: null
labels:
component: kube-controller-manager
tier: control-plane
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- command:
- kube-controller-manager
- --address=127.0.0.1
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
- --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
- --controllers=*,bootstrapsigner,tokencleaner
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --leader-elect=true
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --root-ca-file=/etc/kubernetes/pki/ca.crt
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --use-service-account-credentials=true
- --pod-eviction-timeout=30s
(the yaml is truncated, only the related first part is showed here).
Check that the settings is applied:
sudo docker ps --no-trunc | grep "kube-controller-manager"
Apply a deployment with two replicas, check that one pod is created on first worker node, the second is created on the second worker node. Shut down one of the nodes, and check the elapsed time between the event, when the node reports “NotReady” and the pod re-created.
Anything else we need to know?: I experience the same issue in multi-master environment also.
Environment:
- Kubernetes version (use
kubectl version): v1.13.3 Client Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-01T20:08:12Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-01T20:00:57Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration: Azure VM
- OS (e.g:
cat /etc/os-release): NAME=“Ubuntu” VERSION=“16.04.5 LTS (Xenial Xerus)” - Kernel (e.g.
uname -a): Linux nodetest21 4.15.0-1037-azure #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Install tools:
- Others: Docker v18.06.1-ce
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (4 by maintainers)
Commits related to this issue
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver fla... — committed to gardener/gardener by ialidzhikov 2 years ago
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver fla... — committed to gardener/gardener by ialidzhikov 2 years ago
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver fla... — committed to gardener/gardener by ialidzhikov 2 years ago
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to gardener/gardener by ialidzhikov 2 years ago
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to gardener/gardener by ialidzhikov 2 years ago
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to gardener/gardener by ialidzhikov 2 years ago
- Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to ialidzhikov/gardener by ialidzhikov 2 years ago
- Support for Kubernetes v1.26 (#7275) * Allow instantiating v1.26 Kubernetes clients * Update `README.md` and `docs/usage/supported_k8s_versions.md` for the K8s 1.26 * Maintain Kubernetes feature ga... — committed to gardener/gardener by ialidzhikov a year ago
Thanks for your feedback ChiefAlexander! That is the situation, you wrote. I checked the pods, and sure there are the default values assigned to pod for toleration:
So I just simply added my own values to the deployment:
After applying the deployment in case of node failure, node status changes to “NotReady”, then pods re-created after 2 seconds.
So we don’t have to deal with pod-eviction-timeout anymore, timeout can be set on Pod basis! Cool!
Thanks again for your help!
Is it possible to make it global? I don’t want to enable that for each pod config, especially that I use a lot of prepared things from helm
Looking into this more. With TaintBasedEvictions set to true you can set your pods eviction time within its spec under tolerations: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions The default values of these are getting set by an admission controller: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/defaulttolerationseconds/admission.go#L34 Those two flags can be set via the kube-apiserver and should achieve the same effect.
I also ran into this issue while testing setting the eviction timeout lower. After poking around at this for sometime I figured out that the cause is the new TaintBasedEvictions.
Setting the feature flag for this to false causes pods to be evicted like expected. I have not taken time to search through the taint based eviction code but I would guess that we are not utilizing this eviction timeout flag within it.
may you need set this for kube-apiserver :
Why had this bug been marked as closed? It does look like the original issue is not solved, but only work-arounded. It is not clear to me why the pod-eviction-timeout flag is not working
+1 for having the possibility to configure it per whole cluster. tuning per pod or per deployment is rarely useful: in most cases a sane global value is waaay more convenient and the current default of 5m is waaay to long for many cases.
please please reopen this issue.
I think that you can configure global pod eviction via apiserver: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ I didn’t try this, but as I can see there are options: --default-not-ready-toleration-seconds and --default-unreachable-toleration-seconds.
I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?