kubernetes: Node does not become NotReady with read only filesystem

What happened:

We noticed on our clusters on bare-metal that nodes do not become NotReady after a read only remount by the kernel e.g. due to a filesystem corruption. This causes pods to get scheduled on the node but fail to start as the kubelet cannot create directories for the pod.

What you expected to happen:

Kubelet should notice that it cannot write to the filesystem and prevent further pods from being scheduled on the node.

How to reproduce it (as minimally and precisely as possible):

  1. install minikube and virtualbox
  2. minikube start --driver=virtualbox -n 2
$ kubectl get nodes
NAME           STATUS   ROLES                  AGE    VERSION
minikube       Ready    control-plane,master   3m6s   v1.20.2
minikube-m02   Ready    <none>                 119s   v1.20.2
  1. start a pod on a machine
kubectk get pod                              
NAME                       READY   STATUS    RESTARTS   AGE
hello-1-657cb9b9f5-brbf4   1/1     Running   0          16s
  1. ssh into the worker node and trigger an emergency readonly remount (simulate a filesystem failure) then wait a few minutes
minikube ssh --node minikube-m02
echo u | sudo tee /proc/sysrq-trigger
  1. node stays ready and attracts new pods
kubectl get node
NAME           STATUS   ROLES                  AGE     VERSION
minikube       Ready    control-plane,master   10m     v1.20.2
minikube-m02   Ready    <none>                 9m12s   v1.20.2

kubectl get pod
NAME                       READY   STATUS              RESTARTS   AGE
hello-1-657cb9b9f5-brbf4   1/1     Running             0          8m41s
hello-2-7ddff58f66-6mgbm   0/1     ContainerCreating   0          16s

kubectl describe pod hello-2-7ddff58f66-6mgbm
...
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Normal   Scheduled    33s               default-scheduler  Successfully assigned default/hello-2-7ddff58f66-6mgbm to minikube-m02
  Warning  Failed       9s (x3 over 33s)  kubelet            error making pod data directories: mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system
  Warning  FailedMount  1s (x7 over 33s)  kubelet            MountVolume.SetUp failed for volume "default-token-5fjs5" : mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.20.2 (also reproducible in latest 1.18.x and 1.19.x)
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g: cat /etc/os-release): Flatcar Container Linux by Kinvolk 2605.12.0 (Oklo)
  • Kernel (e.g. uname -a): 5.4.92-flatcar

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 19 (6 by maintainers)

Most upvoted comments

We could act on the FailedToMakePodDataDirectories and FailedMountVolume events, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1696-L1710, and update the node ReadyCondition so pods will not be scheduled. But this should only be for a finite amount of time otherwise the node will not be allowed to recover.

Not sure if my assessment is on the right track, but if any work needs to be done here, I’d be happy to take it up 😃