kubernetes: NodeLifeCycleController: all pods are marked as not ready making workload services unavailable during network partition b/w master & worker nodes

What happened?

We’ve encountered a failure scenario in production where if master nodes hosting etcd, apiserver, controller-manager etc gets network partitioned with worker nodes (say due to apiserver lb outage or a widespread network infra outage), all the pods hosted in that k8s cluster gets marked as not ready making endpoint controller remove the pods IP addresses from all services matching them. It causes a widespread outage for all the service workloads hosted on that kubernetes cluster as pods are marked as not ready even though all the containers comprising them are ready.

What did you expect to happen?

We expected only pods that are marked for eviction during the above circumstances to be marked as not ready. It’ll help us in delaying or eventually stopping a widespread workloads service unavailability.

How can we reproduce it (as minimally and precisely as possible)?

Host a few load balancer/clusterIP services in k8s cluster.
Create a network partition between master & worker nodes (using iptable rules or something similar). Each master node is able to connect each other and deployed with components stacked together hosting etcd, apiserver, controller-manager, cloud-controller, scheduler, cni etc.
You can see that the load balancer external IP address is not able to route traffic to pods since addresses are removed from the endpoint resource after the node lifecycle controller marking pods as not ready.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-13T19:50:45Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

On-Premise Infrastructure.

OS version


$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

$ uname -a
Linux xxx-master-1 5.10.0-0.bpo.3-cloud-amd64 #1 SMP Debian 5.10.13-1~bpo10+1 (2021-02-11) x86_64 GNU/Linux

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: open
Created 3 years ago
Comments: 30 (21 by maintainers)

Most upvoted comments

We expected only pods that are marked for eviction during the above circumstances to be marked as not ready. It’ll help us in delaying or eventually stopping a widespread workloads service unavailability.

I’m not sure I understand why you’d expect only pods marked for eviction to be marked “Not Ready”.

If you have a network partition that completely severs the control plane from worker nodes, it makes sense to me that the node lifecycle controller would label those pods as not ready, since the control plane cannot contact them. Readiness is a control plane concept that indicates whether or not a pod can serve load balancer traffic. If the control plane can’t contact a node, then it can’t load balance requests to that node.

This to me sounds like it’s working as expected. Your workloads will still be running; they just can’t serve network traffic (which makes sense because your network is fully partitioned).

/remove-sig api-machinery /sig network /close

ehashman on Oct 14, 2021