longhorn: [BUG] Volumes are not properly mounted/unmounted when kubelet restarts
Describe the bug Pods within a statefulset get stuck in Terminating or Pending state after kubelet goes unready back to ready. Volume never becomes unattached for some reason.
To Reproduce
Steps to reproduce the behavior:
in order to reproduce we took a multi-node cluster with a stateful set (in this case prometheus from cattle-monitoring), stop the rke2-server process that is on the node where the pod is, wait for the node to go not ready, kubectl delete the pod, then start rke2-server and wait patiently
This by way of @oats87
Expected behavior Pods should recover from a kubelet restart/crash
Log I have logs, please ping me directly as they might contain sensitive information.
Environment:
- Longhorn version: v1.1.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rke2
- Number of management node in the cluster: 1 node cluster with all roles
- Node config
- OS type and version: rhel 8
- CPU per node: 16
- Memory per node: 64
- Disk type(e.g. SSD/NVMe): ssd
- Network bandwidth between the nodes: NA
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Azure
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 25 (16 by maintainers)
Just leaving a quick update here: @Oats87 and I continued looking into the Issue and we think that a change on the CSI driver would help us exit the error loop mentioned above. By unmouting the corrupt mount point during the NodePublishCall, then the following NodePublishCall will mount it again, we can combine both of these actions into 1 NodePublishCall for existing corrupt mounts. Example commit: https://github.com/longhorn/longhorn-manager/commit/62eef5b7d7f931ef218e8183cb1056d72cee87cc
REF: https://github.com/kubernetes/kubernetes/issues/70013 With this PR kubernetes introduced the responsibility/ability for the csi driver to deal with corrupt mounts. https://github.com/kubernetes/kubernetes/pull/88569
Here is an example PR for the azure driver that implements that cleanup routine mentioned above. https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/308
Other drivers (intree) behave similarly. This might leave some open file handles stale, since the block device major:minor can change between attachments.
But together with the https://github.com/longhorn/longhorn/issues/2650 lifecycle managment changes for the instance-managers this should be even less of a problem, since the engine (volume) should continue being available to the workload pods, therefore the mount wouldn’t even become corrupt (i.e. since the block device never gets detached, no instance-manager failure)
I believe I’ve determined why the volume is never
Unmounted from the node.When the
kubeletstarts up (and once it’s running it continuously does this) and is building its desired state of the world/actual state of the world, it performs asyncStatewhere it checks the pod directory for volumes. It will come across the mount for the Longhorn volume on disk, and proceed to add it to the desired state of the world marked as “InUse”: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/volumemanager/reconciler/reconciler.go#L422 and https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/volumemanager/reconciler/reconciler.go#L441Now, because the
MountVolume.Setupnever actually succeeds for this volume (because we return a false positive “success” here: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_mounter.go#L271), the volume is never added to theactual state of the world, and thus, when the subsequent evaluation forUnmountthe volume comes along, it doesn’t get processed as the volume isn’t actually in the actual state of the world: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L286-L292ref: https://github.com/kubernetes/kubernetes/pull/110670
@khushboo-rancher can you test this image:
joshimoo/longhorn-manager:ui-path-v1I broke the broke the corrupt mount point detection. I will have another look tomorrow.One way I was reproducing this was:
Using Rancher
v2.5.8, register my cluster and installcattle-monitoring v2into it, with prometheus using persistent storage (longhorn class) of any size.Allow things to come up/stabilize. On the node where the
prometheuspod is running, restart thekubelet, and observe the pod goes into terminating state forever./cc @khushboo-rancher
Investigated this issue with @joshimoo in a live debug session.
We’re finding that the
kubeletis in fact getting stuck in aMountVolume.SetUploop, and is never able to finish this because it is unable to set ownership on the directory. This can be seen here: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_mounter.go#L271Interestingly, the Longhorn
csi pluginis receiving theNodePublishcalls, but is returning essentially a “volume already mounted” as it is not checking the health of the mount, but rather just the fact that the mount point exists. This can be seen here: https://github.com/longhorn/longhorn-manager/blob/0f63e8fae7dc979b72d1529033d5669d22264d56/csi/node_server.go#L205The
csi_client.goattempts to callNodePublishand based on the error result of that, returns different types of errors. https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_client.go#L255-L258 If the error is a final error, it will remove the mount path in question, which will free up the node and have the node attempt to either callNodePublishagain (thus remounting the volume) orNodeUnpublishwhich would be ideal in our case.Current thought on our end is potentially to be returning an error from the
node_server.goif the mount point exists but is unhealthy, which will trigger the removal of the mount point, although I’m not actually sure thekubeletwill be able to remove the folder considering it is a mount point, but at least the client will return an final error rather than a transient error.I think the
UncertainProgresserror is how we’re accidentally ending up in aMountVolume.SetUploop: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_mounter.go#L277@khushboo-rancher https://github.com/longhorn/longhorn-manager/pull/945 has been merged, so the fixes are now also available to test on master.
Working with
joshimoo/longhorn-manager:ui-path-v1image on a RKE1 multi node cluster. The pods are getting restarted after termination in case of kubelet restart.