kubernetes: kubernetes reliability/reactivity on node failure not work properly

When a node fails by default the normal behavior looks like:

Kubelet updates it status to apiserver periodically, as specified by –node-status-update-frequency. The default value is 10s.
Kubernetes controller manager checks the statuses of Kubelets every –-node-monitor-period. The default value is 5s.
In case the status is updated within –node-monitor-grace-period of time, Kubernetes controller manager considers healthy status of Kubelet. The default value is 40s.
After --node-monitor-grace-period it will consider node unhealthy. It will remove its pods based on –pod-eviction-timeout The default value is 5m0s

I have tried with -–node-status-update-frequency (kubelet daemon as KUBELET_EXTRA_ARGS) set to 4s (10s is default). –node-monitor-period to 2s (5s is default). –node-monitor-grace-period to 20s (40s is default). –pod-eviction-timeout set to 30s (5m is default). The last three configurations have been made on the kube-controller-manager (/etc/kubernetes/manifests/kube-controller-manager.yaml)

What happened: Pods will be evicted in 5m as if no configuration had been provided.

What you expected to happen: In such scenario, pods will be evicted in about 50s because the node will be considered as down after 20s, and --pod-eviction-timeout occurs after 30s more.

How to reproduce it (as minimally and precisely as possible): Create a right configuration, for example: -–node-status-update-frequency to 4s (kubelet) –node-monitor-period to 2s –node-monitor-grace-period to 20s –pod-eviction-timeout set to 30s

Run pod (on worker node) than brutally shutdown the node with the pod running. Wait, and after about 5min (no 50s) the pod is restarted on another worker.

Anything else we need to know?: Exactly the same configuration, stack software and hw, test procedure, but with k8s v1.12.4 works very well…!! Pods will be evicted in about 50s.

With both k8s v1.12.4 and v1.13.1 the controller-manager containers have been checked (with docker command) and have always shown the correct configuration provided: –node-monitor-period to 2s –node-monitor-grace-period to 20s –pod-eviction-timeout set to 30s

Environment:

Kubernetes version (use kubectl version): 1.13.1
Cloud provider or hardware configuration: On-premises installation
OS (e.g. from /etc/os-release): CentOS Linux 7
Kernel (e.g. uname -a): Linux 3.10.0-957.1.3.el7.x86_64
Install tools: None
Others: Multi master (3 node) configuration with internal etcd and 3 worker node and an external load balancer.

/kind bug

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 5
Comments: 26 (3 by maintainers)

Most upvoted comments

I “simply” believe that in the transition between 1.12 and 1.13 they introduced a bug in dealing with the monitor-period ooptions. It would be useful to know a roadmap for solving this important problem!

UP!

ghost on Feb 18, 2019

Regarding the pod-eviction-timeout issue, I was also facing the same issue with k8s 1.15.0 and then read about the change as mentioned below.

“In version 1.13, the TaintBasedEvictions feature is promoted to beta and enabled by default, hence the taints are automatically added by the NodeController (or kubelet) and the normal logic for evicting pods from nodes based on the Ready NodeCondition is disabled”

I got it working by setting the flags as below(https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/)

–default-not-ready-toleration-seconds=30 –default-unreachable-toleration-seconds=30

ranjujohn on Jul 17, 2019

The pod-eviction-timeout took effects after set the TaintBasedEvictions to false (- --feature-gates=TaintBasedEvictions=false on kubectl conf) because de fault value now is true since 1.13. Here we can find all new features and default value linked to a specific version.

https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/#

So i think that the new feature TaintBasedEvictions has broken the legacy functionality of pod eviction. But I still have a problem with TaintBasedEvictions. Infact my new kubelet extra args configuration is:

–feature gates=“TaintBasedEvictions=true,TaintNodesByCondition=true,NodeLease=true”

I have this Deployment:

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 1
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
      tolerations:
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 30
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 30

All seams to work well but I realized that the state of my nodes is constantly oscillating.

NAME               STATUS     ROLES    AGE    VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION              CONTAINER-RUNTIME
master-servers-1   Ready      master   115m   v1.13.3   192.168.56.110   <none>        CentOS Linux 7 (Core)   3.10.0-957.1.3.el7.x86_64   docker://18.9.2
master-servers-2   Ready      master   113m   v1.13.3   192.168.56.111   <none>        CentOS Linux 7 (Core)   3.10.0-957.1.3.el7.x86_64   docker://18.9.2
master-servers-3   NotReady   master   113m   v1.13.3   192.168.56.112   <none>        CentOS Linux 7 (Core)   3.10.0-957.1.3.el7.x86_64   docker://18.9.2
worker-servers-1   Ready      worker   112m   v1.13.3   192.168.56.113   <none>        CentOS Linux 7 (Core)   3.10.0-957.1.3.el7.x86_64   docker://18.9.2
worker-servers-2   Ready      worker   112m   v1.13.3   192.168.56.114   <none>        CentOS Linux 7 (Core)   3.10.0-957.1.3.el7.x86_64   docker://18.9.2

Infact on all node (3 master 2 worker, installed by kubeadmin) in kubectl logs (journalctl -u kubelet -f) I have this error:

mar 01 12:01:36 master-servers-2 kubelet[25763]: E0301 12:01:36.252570 25763 controller.go:115] failed to ensure node lease exists, will retry in 7s, error: leases.coordination.k8s.io “master-servers-2” is forbidden: User “system:node:master-servers-2” cannot get resource “leases” in API group “coordination.k8s.io” in the namespace “kube-node-lease”: disabled by feature gate NodeLease

Permission issue?!

Also the documentation states that: “When node lease feature is enabled, each node has an associated Lease object in kube-node-lease namespace that is renewed by the node periodically, and both NodeStatus and node lease are treated as heartbeats from the node.”

But I have not kube-node-lease namepsace. 😦

ghost on Mar 1, 2019

When node is considered failed, the related pod is still considered healthy. Why not set the NotReady node’s pods to unhealthy when set the node to NotReady? What happened:

NAME    STATUS     ROLES    AGE   VERSION
yld-0   Ready      master   17d   v1.13.3
yld-1   NotReady   <none>   17d   v1.13.3
kubectl get po -w
NAME                    READY   STATUS    RESTARTS   AGE
curl-649649b6c7-gcc4k   1/1     Running   0          10m

What you expected to happen:

NAME                    READY   STATUS    RESTARTS   AGE
curl-649649b6c7-gcc4k   0/1     Running   0          10m

yld2019 on Feb 20, 2019