kubernetes: All Pods on an unreachable node are marked NotReady upon the node turned Unknown

What happened: We reached a corner case in our kubernetes cluster:

a network partition between kubelet and the apiserver. ALL our nodes unable to report to the apiserver (misconfiguration of the loadbalancer between kubelet and the apiserver).
As expected, after 40 seconds, all the nodes status turn to Unknown.
However, at the same time, all the pods on all the nodes were marked NotReady. Endpoints available dropped to zero for every service running in the cluster. 100% traffic suddenly failed at our ingress. 100% of DNS queries failed inside the cluster.
As expected, after 5 minutes, some of the pods started being evicted. However eviction stopped because this was a full cluster outage.

What you expected to happen:

One of the following two options:

Pods readiness should NOT be marked as NotReady when the Node condition turns Unknown. Pod should be marked unready only when they are actually being evicted from the cluster.

The documentation mentions the following corner case.

The corner case is when all zones are completely unhealthy (i.e. there are no healthy nodes in the cluster). In such case, the node controller assumes that there’s some problem with master connectivity and stops all evictions until some connectivity is restored. https://kubernetes.io/docs/concepts/architecture/nodes/

OR, the documentation should be updated to explain this behaviour.

How to reproduce it (as minimally and precisely as possible):

Stop kubelet on one of the node
Wait 40 seconds
Watch all the pods running on that node being marked an NotReady (and removed from the service endpoints)

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.11,1.12,1.13,1.14
Cloud provider or hardware configuration: NA
OS (e.g: cat /etc/os-release): NA
Kernel (e.g. uname -a): NA
Install tools: NA
Network plugin and version (if this is a network-related bug): NA
Others:

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 18 (7 by maintainers)

Most upvoted comments

@chaudyg thanks for looping me in, please submit the full postmortem on https://github.com/hjacobs/kubernetes-failure-stories when ready 👏

hjacobs on May 14, 2019

@chaudyg thanks for the additional info!

I agree with you that we should look at exposing some form of parameter giving the cluster admin control over how quickly the pods are marked NotReady. We could set a sensible default (perhaps even what exists now).

Curious to hear from @derekwaynecarr when he gets a second 😃

mattjmcnaughton on Jun 11, 2019

Thx @mattjmcnaughton for your reply.

Is seems like an intentional behaviour. But I am failing to grasp the full rational behind it.

What I am convinced of is that having the apiserver unreachable for 40+ seconds should NOT result in a full cluster outage.

They are 2 scenarios in this part of the code:

The node turns NotReady. In this case I can understand why marking pods as NotReady quickly make sense. It knows the underlying node in unhealthy, and assumes the underlying pods won’t perform as expected. I will recommend we document this behaviour.
The node turns Unknown. In this case the controller cannot make an informed decision. The overall rational in the controller in an “unknown” scenario seems to be that the controller acts slowly and carefully. If first keeps the status quo by granting a 5m grace period to the node. Then it starts recreating some of pods slowly (eviction-rate flag) but only if some nodes are still healthy. If none of the nodes are ready, it won’t make any decision. It feels like a sensible approach. Removing all the endpoints feels like a ruched decision. I would recommend we don’t mark the pod as NotReady in this case.

chaudyg on May 14, 2019