kubernetes: Node does not become NotReady with read only filesystem

What happened:

We noticed on our clusters on bare-metal that nodes do not become NotReady after a read only remount by the kernel e.g. due to a filesystem corruption. This causes pods to get scheduled on the node but fail to start as the kubelet cannot create directories for the pod.

What you expected to happen:

Kubelet should notice that it cannot write to the filesystem and prevent further pods from being scheduled on the node.

How to reproduce it (as minimally and precisely as possible):

install minikube and virtualbox
minikube start --driver=virtualbox -n 2

$ kubectl get nodes
NAME           STATUS   ROLES                  AGE    VERSION
minikube       Ready    control-plane,master   3m6s   v1.20.2
minikube-m02   Ready    <none>                 119s   v1.20.2

start a pod on a machine

kubectk get pod                              
NAME                       READY   STATUS    RESTARTS   AGE
hello-1-657cb9b9f5-brbf4   1/1     Running   0          16s

ssh into the worker node and trigger an emergency readonly remount (simulate a filesystem failure) then wait a few minutes

minikube ssh --node minikube-m02
echo u | sudo tee /proc/sysrq-trigger

node stays ready and attracts new pods

kubectl get node
NAME           STATUS   ROLES                  AGE     VERSION
minikube       Ready    control-plane,master   10m     v1.20.2
minikube-m02   Ready    <none>                 9m12s   v1.20.2

kubectl get pod
NAME                       READY   STATUS              RESTARTS   AGE
hello-1-657cb9b9f5-brbf4   1/1     Running             0          8m41s
hello-2-7ddff58f66-6mgbm   0/1     ContainerCreating   0          16s

kubectl describe pod hello-2-7ddff58f66-6mgbm
...
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Normal   Scheduled    33s               default-scheduler  Successfully assigned default/hello-2-7ddff58f66-6mgbm to minikube-m02
  Warning  Failed       9s (x3 over 33s)  kubelet            error making pod data directories: mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system
  Warning  FailedMount  1s (x7 over 33s)  kubelet            MountVolume.SetUp failed for volume "default-token-5fjs5" : mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.20.2 (also reproducible in latest 1.18.x and 1.19.x)
Cloud provider or hardware configuration: bare metal
OS (e.g: cat /etc/os-release): Flatcar Container Linux by Kinvolk 2605.12.0 (Oklo)
Kernel (e.g. uname -a): 5.4.92-flatcar

About this issue

Original URL
State: open
Created 3 years ago
Comments: 19 (6 by maintainers)

Most upvoted comments

We could act on the FailedToMakePodDataDirectories and FailedMountVolume events, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1696-L1710, and update the node ReadyCondition so pods will not be scheduled. But this should only be for a finite amount of time otherwise the node will not be allowed to recover.

Not sure if my assessment is on the right track, but if any work needs to be done here, I’d be happy to take it up 😃

lyzs90 on May 25, 2021