kubernetes: Kubelet is hanging due broken NFS share

What happened:

  • We had a NFS-server.
  • Some pods were using this share
  • NFS-server has died
  • Pods were removed forcefully (with --force --grace-period 0)
  • Kubelet was stuck operations for any other pods, eg. when you create new pods they are stuck on Pending, if you do systemctl restart kubelet, they are changing their state to ContainerCreating but still stuck. However on the docker side I see that these pods are up and running. The next kubelet restart turns them to Running. The same with removing, eg. you run kubectl remove pod and it is stucking on Termination, but pod is continue running on the docker side, if you restart kubelet, then pod is continue removing.

I presume that kubelet is stucking on broken NFS-shares, which is still mounted on /var/lib/kubelet/pods/, eg:

/var/lib/kubelet/pods/ce591a55-9ad6-4bf2-bcec-ac29fc77fcad/volume-subpaths/opennebula-control-volume/logstash/2

dmesg show an errors:

[11859608.510084] nfs: server 10.36.3.5 not responding, still trying
[11860426.154179] nfs: server 10.36.3.5 not responding, still trying
[11860432.298016] nfs: server 10.36.3.5 not responding, still trying

What you expected to happen:

Broken NFS-shares should not affect the operation of kubelet for the other pods.

How to reproduce it (as minimally and precisely as possible):

  • Create a NFS-server and export some directory from it.
  • Run pod using this NFS-share
  • Stop NFS-server (better to stop whole server)
  • Remove pod forcefully (with --force --grace-period 0)
  • Restart kubelet
  • Try create new pods

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.18.2 and v1.17.5
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g: cat /etc/os-release): Debian GNU/Linux 9 (stretch)
  • Kernel (e.g. uname -a): 4.15.18-9-pve
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): kube-router
  • Others:

/sig node

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 22 (10 by maintainers)

Most upvoted comments

@kvaps When you say the NFS server died? Do you mean that it went offline and never recovered? If so, can you run lsof and see if there stale open files on that mount point?