kubernetes: kubernetes reliability/reactivity on node failure not work properly
When a node fails by default the normal behavior looks like:
- Kubelet updates it status to apiserver periodically, as specified by –node-status-update-frequency. The default value is 10s.
- Kubernetes controller manager checks the statuses of Kubelets every –-node-monitor-period. The default value is 5s.
- In case the status is updated within –node-monitor-grace-period of time, Kubernetes controller manager considers healthy status of Kubelet. The default value is 40s.
- After --node-monitor-grace-period it will consider node unhealthy. It will remove its pods based on –pod-eviction-timeout The default value is 5m0s
I have tried with -–node-status-update-frequency (kubelet daemon as KUBELET_EXTRA_ARGS) set to 4s (10s is default). –node-monitor-period to 2s (5s is default). –node-monitor-grace-period to 20s (40s is default). –pod-eviction-timeout set to 30s (5m is default). The last three configurations have been made on the kube-controller-manager (/etc/kubernetes/manifests/kube-controller-manager.yaml)
What happened: Pods will be evicted in 5m as if no configuration had been provided.
What you expected to happen: In such scenario, pods will be evicted in about 50s because the node will be considered as down after 20s, and --pod-eviction-timeout occurs after 30s more.
How to reproduce it (as minimally and precisely as possible): Create a right configuration, for example: -–node-status-update-frequency to 4s (kubelet) –node-monitor-period to 2s –node-monitor-grace-period to 20s –pod-eviction-timeout set to 30s
Run pod (on worker node) than brutally shutdown the node with the pod running. Wait, and after about 5min (no 50s) the pod is restarted on another worker.
Anything else we need to know?: Exactly the same configuration, stack software and hw, test procedure, but with k8s v1.12.4 works very well…!! Pods will be evicted in about 50s.
With both k8s v1.12.4 and v1.13.1 the controller-manager containers have been checked (with docker command) and have always shown the correct configuration provided: –node-monitor-period to 2s –node-monitor-grace-period to 20s –pod-eviction-timeout set to 30s
Environment:
- Kubernetes version (use
kubectl version
): 1.13.1 - Cloud provider or hardware configuration: On-premises installation
- OS (e.g. from /etc/os-release): CentOS Linux 7
- Kernel (e.g.
uname -a
): Linux 3.10.0-957.1.3.el7.x86_64 - Install tools: None
- Others: Multi master (3 node) configuration with internal etcd and 3 worker node and an external load balancer.
/kind bug
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 5
- Comments: 26 (3 by maintainers)
I “simply” believe that in the transition between 1.12 and 1.13 they introduced a bug in dealing with the monitor-period ooptions. It would be useful to know a roadmap for solving this important problem!
UP!
Regarding the pod-eviction-timeout issue, I was also facing the same issue with k8s 1.15.0 and then read about the change as mentioned below.
“In version 1.13, the TaintBasedEvictions feature is promoted to beta and enabled by default, hence the taints are automatically added by the NodeController (or kubelet) and the normal logic for evicting pods from nodes based on the Ready NodeCondition is disabled”
I got it working by setting the flags as below(https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/)
–default-not-ready-toleration-seconds=30 –default-unreachable-toleration-seconds=30
The pod-eviction-timeout took effects after set the TaintBasedEvictions to false (- --feature-gates=TaintBasedEvictions=false on kubectl conf) because de fault value now is true since 1.13. Here we can find all new features and default value linked to a specific version.
So i think that the new feature TaintBasedEvictions has broken the legacy functionality of pod eviction. But I still have a problem with TaintBasedEvictions. Infact my new kubelet extra args configuration is:
I have this Deployment:
All seams to work well but I realized that the state of my nodes is constantly oscillating.
Infact on all node (3 master 2 worker, installed by kubeadmin) in kubectl logs (journalctl -u kubelet -f) I have this error:
mar 01 12:01:36 master-servers-2 kubelet[25763]: E0301 12:01:36.252570 25763 controller.go:115] failed to ensure node lease exists, will retry in 7s, error: leases.coordination.k8s.io “master-servers-2” is forbidden: User “system:node:master-servers-2” cannot get resource “leases” in API group “coordination.k8s.io” in the namespace “kube-node-lease”: disabled by feature gate NodeLease
Permission issue?!
Also the documentation states that: “When node lease feature is enabled, each node has an associated Lease object in kube-node-lease namespace that is renewed by the node periodically, and both NodeStatus and node lease are treated as heartbeats from the node.”
But I have not kube-node-lease namepsace. 😦
When node is considered failed, the related pod is still considered healthy. Why not set the NotReady node’s pods to unhealthy when set the node to NotReady? What happened:
What you expected to happen: