kubernetes: Kubelet is hanging due broken NFS share
What happened:
- We had a NFS-server.
- Some pods were using this share
- NFS-server has died
- Pods were removed forcefully (with
--force --grace-period 0
) - Kubelet was stuck operations for any other pods, eg. when you create new pods they are stuck on
Pending
, if you dosystemctl restart kubelet
, they are changing their state toContainerCreating
but still stuck. However on the docker side I see that these pods are up and running. The next kubelet restart turns them toRunning
. The same with removing, eg. you runkubectl remove pod
and it is stucking onTermination
, but pod is continue running on the docker side, if you restart kubelet, then pod is continue removing.
I presume that kubelet is stucking on broken NFS-shares, which is still mounted on /var/lib/kubelet/pods/
, eg:
/var/lib/kubelet/pods/ce591a55-9ad6-4bf2-bcec-ac29fc77fcad/volume-subpaths/opennebula-control-volume/logstash/2
dmesg show an errors:
[11859608.510084] nfs: server 10.36.3.5 not responding, still trying
[11860426.154179] nfs: server 10.36.3.5 not responding, still trying
[11860432.298016] nfs: server 10.36.3.5 not responding, still trying
What you expected to happen:
Broken NFS-shares should not affect the operation of kubelet for the other pods.
How to reproduce it (as minimally and precisely as possible):
- Create a NFS-server and export some directory from it.
- Run pod using this NFS-share
- Stop NFS-server (better to stop whole server)
- Remove pod forcefully (with
--force --grace-period 0
) - Restart kubelet
- Try create new pods
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):v1.18.2
andv1.17.5
- Cloud provider or hardware configuration: bare metal
- OS (e.g:
cat /etc/os-release
): Debian GNU/Linux 9 (stretch) - Kernel (e.g.
uname -a
):4.15.18-9-pve
- Install tools: kubeadm
- Network plugin and version (if this is a network-related bug): kube-router
- Others:
/sig node
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 22 (10 by maintainers)
@kvaps When you say the NFS server died? Do you mean that it went offline and never recovered? If so, can you run
lsof
and see if there stale open files on that mount point?