kubernetes: All Pods on an unreachable node are marked NotReady upon the node turned Unknown
What happened: We reached a corner case in our kubernetes cluster:
- a network partition between kubelet and the apiserver. ALL our nodes unable to report to the apiserver (misconfiguration of the loadbalancer between kubelet and the apiserver).
- As expected, after 40 seconds, all the nodes status turn to Unknown.
- However, at the same time, all the pods on all the nodes were marked NotReady. Endpoints available dropped to zero for every service running in the cluster. 100% traffic suddenly failed at our ingress. 100% of DNS queries failed inside the cluster.
- As expected, after 5 minutes, some of the pods started being evicted. However eviction stopped because this was a full cluster outage.
What you expected to happen:
One of the following two options:
- Pods readiness should NOT be marked as NotReady when the Node condition turns Unknown. Pod should be marked unready only when they are actually being evicted from the cluster.
The documentation mentions the following corner case.
The corner case is when all zones are completely unhealthy (i.e. there are no healthy nodes in the cluster). In such case, the node controller assumes that there’s some problem with master connectivity and stops all evictions until some connectivity is restored. https://kubernetes.io/docs/concepts/architecture/nodes/
- OR, the documentation should be updated to explain this behaviour.
How to reproduce it (as minimally and precisely as possible):
- Stop kubelet on one of the node
- Wait 40 seconds
- Watch all the pods running on that node being marked an NotReady (and removed from the service endpoints)
Anything else we need to know?:
- Line of code: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L764-L769
- End2End test: https://github.com/kubernetes/kubernetes/blob/ee0038adaa9a316a26e435353f629ea4af4b46f1/test/e2e/apps/network_partition.go#L137
Environment:
- Kubernetes version (use
kubectl version): 1.11,1.12,1.13,1.14 - Cloud provider or hardware configuration: NA
- OS (e.g:
cat /etc/os-release): NA - Kernel (e.g.
uname -a): NA - Install tools: NA
- Network plugin and version (if this is a network-related bug): NA
- Others:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 18 (7 by maintainers)
@chaudyg thanks for looping me in, please submit the full postmortem on https://github.com/hjacobs/kubernetes-failure-stories when ready 👏
@chaudyg thanks for the additional info!
I agree with you that we should look at exposing some form of parameter giving the cluster admin control over how quickly the pods are marked
NotReady. We could set a sensible default (perhaps even what exists now).Curious to hear from @derekwaynecarr when he gets a second 😃
Thx @mattjmcnaughton for your reply.
Is seems like an intentional behaviour. But I am failing to grasp the full rational behind it.
What I am convinced of is that having the apiserver unreachable for 40+ seconds should NOT result in a full cluster outage.
They are 2 scenarios in this part of the code: