kubernetes: Node does not become NotReady with read only filesystem
What happened:
We noticed on our clusters on bare-metal that nodes do not become NotReady
after a read only remount by the kernel e.g. due to a filesystem corruption. This causes pods to get scheduled on the node but fail to start as the kubelet cannot create directories for the pod.
What you expected to happen:
Kubelet should notice that it cannot write to the filesystem and prevent further pods from being scheduled on the node.
How to reproduce it (as minimally and precisely as possible):
- install minikube and virtualbox
- minikube start --driver=virtualbox -n 2
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane,master 3m6s v1.20.2
minikube-m02 Ready <none> 119s v1.20.2
- start a pod on a machine
kubectk get pod
NAME READY STATUS RESTARTS AGE
hello-1-657cb9b9f5-brbf4 1/1 Running 0 16s
- ssh into the worker node and trigger an emergency readonly remount (simulate a filesystem failure) then wait a few minutes
minikube ssh --node minikube-m02
echo u | sudo tee /proc/sysrq-trigger
- node stays ready and attracts new pods
kubectl get node
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane,master 10m v1.20.2
minikube-m02 Ready <none> 9m12s v1.20.2
kubectl get pod
NAME READY STATUS RESTARTS AGE
hello-1-657cb9b9f5-brbf4 1/1 Running 0 8m41s
hello-2-7ddff58f66-6mgbm 0/1 ContainerCreating 0 16s
kubectl describe pod hello-2-7ddff58f66-6mgbm
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 33s default-scheduler Successfully assigned default/hello-2-7ddff58f66-6mgbm to minikube-m02
Warning Failed 9s (x3 over 33s) kubelet error making pod data directories: mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system
Warning FailedMount 1s (x7 over 33s) kubelet MountVolume.SetUp failed for volume "default-token-5fjs5" : mkdir /var/lib/kubelet/pods/b7d540b3-c949-4fad-becc-76743a654467: read-only file system
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): v1.20.2 (also reproducible in latest 1.18.x and 1.19.x) - Cloud provider or hardware configuration: bare metal
- OS (e.g:
cat /etc/os-release
): Flatcar Container Linux by Kinvolk 2605.12.0 (Oklo) - Kernel (e.g.
uname -a
): 5.4.92-flatcar
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 19 (6 by maintainers)
We could act on the
FailedToMakePodDataDirectories
andFailedMountVolume
events, https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1696-L1710, and update the node ReadyCondition so pods will not be scheduled. But this should only be for a finite amount of time otherwise the node will not be allowed to recover.Not sure if my assessment is on the right track, but if any work needs to be done here, Iād be happy to take it up š