kubernetes: ignored pod-eviction-timeout settings

What happened: I modified the pod-eviction-timeout settings of kube-controller-manager on the master node (in order to to decrease the amount of time before k8s re-creates a pod in case of node failure). The default value is 5 minutes, I configured 30 seconds. Using the sudo docker ps --no-trunc | grep "kube-controller-manager" command I checked that the modification was successfully applied:

kubeadmin@nodetest21:~$ sudo docker ps --no-trunc | grep "kube-controller-manager"
387261c61ee9cebce50de2540e90b89e2bc710b4126a0c066ef41f0a1fb7cf38   sha256:0482f640093306a4de7073fde478cf3ca877b6fcc2c4957624dddb2d304daef5                         "kube-controller-manager --address=127.0.0.1 --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --client-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --leader-elect=true --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --root-ca-file=/etc/kubernetes/pki/ca.crt --service-account-private-key-file=/etc/kubernetes/pki/sa.key --use-service-account-credentials=true --pod-eviction-timeout=30s"

I applied a basic deployment with two replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

The first pod created on the first worker node, the second pod created on the second worker node:

NAME         STATUS   ROLES    AGE   VERSION
nodetest21   Ready    master   34m   v1.13.3
nodetest22   Ready    <none>   31m   v1.13.3
nodetest23   Ready    <none>   30m   v1.13.3

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE   IP          NODE         NOMINATED NODE   READINESS GATES
default       busybox-74b487c57b-5s6g7             1/1     Running   0          13s   10.44.0.2   nodetest22   <none>           <none>
default       busybox-74b487c57b-6zdvv             1/1     Running   0          13s   10.36.0.1   nodetest23   <none>           <none>
kube-system   coredns-86c58d9df4-gmcjd             1/1     Running   0          34m   10.32.0.2   nodetest21   <none>           <none>
kube-system   coredns-86c58d9df4-wpffr             1/1     Running   0          34m   10.32.0.3   nodetest21   <none>           <none>
kube-system   etcd-nodetest21                      1/1     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-apiserver-nodetest21            1/1     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-controller-manager-nodetest21   1/1     Running   0          20m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-proxy-6mcn8                     1/1     Running   1          31m   10.0.1.5    nodetest22   <none>           <none>
kube-system   kube-proxy-dhdqj                     1/1     Running   0          30m   10.0.1.6    nodetest23   <none>           <none>
kube-system   kube-proxy-vqjg8                     1/1     Running   0          34m   10.0.1.4    nodetest21   <none>           <none>
kube-system   kube-scheduler-nodetest21            1/1     Running   1          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   weave-net-9qls7                      2/2     Running   3          31m   10.0.1.5    nodetest22   <none>           <none>
kube-system   weave-net-h2cb6                      2/2     Running   0          33m   10.0.1.4    nodetest21   <none>           <none>
kube-system   weave-net-vkb62                      2/2     Running   0          30m   10.0.1.6    nodetest23   <none>           <none>

To test the correct pod eviction I shutdown the first worker node. After ~1 min the status of the first worker node changed to “NotReady”, then I had to wait +5 minutes (which is the default pod eviction timeout) for pod on the turned off node to be re-created on the other node.

What you expected to happen: After the node status reports “NotReady”, the pod should be re-created on the other node after 30 seconds instead if the default 5 minutes!

How to reproduce it (as minimally and precisely as possible): Create three nodes. Init Kubernetes on the first node (sudo kubeadm init), apply network plugin (kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"), then join the other two nodes (like: kubeadm join 10.0.1.4:6443 --token xdx9y1.z7jc0j7c8g8lpjog --discovery-token-ca-cert-hash sha256:04ae8388f607755c14eed702a23fd47802d5512e092b08add57040a2ae0736ac). Add pod-eviction-timeout parameter to Kube Controller Manager on the master node: sudo vi /etc/kubernetes/manifests/kube-controller-manager.yaml:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: kube-controller-manager
    tier: control-plane
  name: kube-controller-manager
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-controller-manager
    - --address=127.0.0.1
    - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --controllers=*,bootstrapsigner,tokencleaner
    - --kubeconfig=/etc/kubernetes/controller-manager.conf
    - --leader-elect=true
    - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --use-service-account-credentials=true
    - --pod-eviction-timeout=30s

(the yaml is truncated, only the related first part is showed here).

Check that the settings is applied: sudo docker ps --no-trunc | grep "kube-controller-manager"

Apply a deployment with two replicas, check that one pod is created on first worker node, the second is created on the second worker node. Shut down one of the nodes, and check the elapsed time between the event, when the node reports “NotReady” and the pod re-created.

Anything else we need to know?: I experience the same issue in multi-master environment also.

Environment:

Kubernetes version (use kubectl version): v1.13.3 Client Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-01T20:08:12Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-01T20:00:57Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: Azure VM
OS (e.g: cat /etc/os-release): NAME=“Ubuntu” VERSION=“16.04.5 LTS (Xenial Xerus)”
Kernel (e.g. uname -a): Linux nodetest21 4.15.0-1037-azure #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others: Docker v18.06.1-ce

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 17 (4 by maintainers)

Commits related to this issue

Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver fla... — committed to gardener/gardener by ialidzhikov 2 years ago
Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver fla... — committed to gardener/gardener by ialidzhikov 2 years ago
Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver fla... — committed to gardener/gardener by ialidzhikov 2 years ago
Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to gardener/gardener by ialidzhikov 2 years ago
Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to gardener/gardener by ialidzhikov 2 years ago
Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to gardener/gardener by ialidzhikov 2 years ago
Deprecate the `podEvictionTimeout` field in favor of newly introduced kube-apiserver fields The kube-controller-manager flag `--pod-eviction-timeout` is deprecated in favor of the kube-apiserver flag... — committed to ialidzhikov/gardener by ialidzhikov 2 years ago
Support for Kubernetes v1.26 (#7275) * Allow instantiating v1.26 Kubernetes clients * Update `README.md` and `docs/usage/supported_k8s_versions.md` for the K8s 1.26 * Maintain Kubernetes feature ga... — committed to gardener/gardener by ialidzhikov a year ago

Most upvoted comments

Thanks for your feedback ChiefAlexander! That is the situation, you wrote. I checked the pods, and sure there are the default values assigned to pod for toleration:

kubectl describe pod busybox-74b487c57b-95b6n | grep -i toleration -A 2
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

So I just simply added my own values to the deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 2
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        imagePullPolicy: IfNotPresent
        name: busybox
      restartPolicy: Always

After applying the deployment in case of node failure, node status changes to “NotReady”, then pods re-created after 2 seconds.

So we don’t have to deal with pod-eviction-timeout anymore, timeout can be set on Pod basis! Cool!

Thanks again for your help!

+19

danielloczi on Feb 28, 2019

Is it possible to make it global? I don’t want to enable that for each pod config, especially that I use a lot of prepared things from helm

kamilgregorczyk on Jan 2, 2020

Looking into this more. With TaintBasedEvictions set to true you can set your pods eviction time within its spec under tolerations: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions The default values of these are getting set by an admission controller: https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/defaulttolerationseconds/admission.go#L34 Those two flags can be set via the kube-apiserver and should achieve the same effect.

ChiefAlexander on Feb 27, 2019

I also ran into this issue while testing setting the eviction timeout lower. After poking around at this for sometime I figured out that the cause is the new TaintBasedEvictions.

In version 1.13, the TaintBasedEvictions feature is promoted to beta and enabled by default, hence the taints are automatically added by the NodeController (or kubelet) and the normal logic for evicting pods from nodes based on the Ready NodeCondition is disabled.

Setting the feature flag for this to false causes pods to be evicted like expected. I have not taken time to search through the taint based eviction code but I would guess that we are not utilizing this eviction timeout flag within it.

ChiefAlexander on Feb 27, 2019

I use those lines in deployment - as others say, global/cluster setting is better. How am I supposed to hit SLA’s when it’s 5 minutes?

may you need set this for kube-apiserver :

   --enable-admission-plugins=....,DefaultTolerationSeconds \
  --default-not-ready-toleration-seconds=60 \
  --default-unreachable-toleration-seconds=60 \

zhangguanzhang on Sep 18, 2021

Why had this bug been marked as closed? It does look like the original issue is not solved, but only work-arounded. It is not clear to me why the pod-eviction-timeout flag is not working

enricovittorini on Apr 13, 2020

+1 for having the possibility to configure it per whole cluster. tuning per pod or per deployment is rarely useful: in most cases a sane global value is waaay more convenient and the current default of 5m is waaay to long for many cases.

please please reopen this issue.

morgwai on Feb 8, 2020

I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?

I think that you can configure global pod eviction via apiserver: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ I didn’t try this, but as I can see there are options: --default-not-ready-toleration-seconds and --default-unreachable-toleration-seconds.

hrbasic on Mar 30, 2020

I am facing this same problem, Is there a way to unenable Taint based Evictions and that pod-eviction-timeout works in global mode?

richardqa on Mar 6, 2020