longhorn: [BUG] Volumes are not properly mounted/unmounted when kubelet restarts

Describe the bug Pods within a statefulset get stuck in Terminating or Pending state after kubelet goes unready back to ready. Volume never becomes unattached for some reason.

To Reproduce Steps to reproduce the behavior: in order to reproduce we took a multi-node cluster with a stateful set (in this case prometheus from cattle-monitoring), stop the rke2-server process that is on the node where the pod is, wait for the node to go not ready, kubectl delete the pod, then start rke2-server and wait patiently

This by way of @oats87

Expected behavior Pods should recover from a kubelet restart/crash

Log I have logs, please ping me directly as they might contain sensitive information.

Environment:

  • Longhorn version: v1.1.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: rke2
    • Number of management node in the cluster: 1 node cluster with all roles
  • Node config
    • OS type and version: rhel 8
    • CPU per node: 16
    • Memory per node: 64
    • Disk type(e.g. SSD/NVMe): ssd
    • Network bandwidth between the nodes: NA
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Azure

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (16 by maintainers)

Most upvoted comments

Just leaving a quick update here: @Oats87 and I continued looking into the Issue and we think that a change on the CSI driver would help us exit the error loop mentioned above. By unmouting the corrupt mount point during the NodePublishCall, then the following NodePublishCall will mount it again, we can combine both of these actions into 1 NodePublishCall for existing corrupt mounts. Example commit: https://github.com/longhorn/longhorn-manager/commit/62eef5b7d7f931ef218e8183cb1056d72cee87cc

REF: https://github.com/kubernetes/kubernetes/issues/70013 With this PR kubernetes introduced the responsibility/ability for the csi driver to deal with corrupt mounts. https://github.com/kubernetes/kubernetes/pull/88569

Here is an example PR for the azure driver that implements that cleanup routine mentioned above. https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/308

Other drivers (intree) behave similarly. This might leave some open file handles stale, since the block device major:minor can change between attachments.

But together with the https://github.com/longhorn/longhorn/issues/2650 lifecycle managment changes for the instance-managers this should be even less of a problem, since the engine (volume) should continue being available to the workload pods, therefore the mount wouldn’t even become corrupt (i.e. since the block device never gets detached, no instance-manager failure)

I believe I’ve determined why the volume is never Unmounted from the node.

When the kubelet starts up (and once it’s running it continuously does this) and is building its desired state of the world/actual state of the world, it performs a syncState where it checks the pod directory for volumes. It will come across the mount for the Longhorn volume on disk, and proceed to add it to the desired state of the world marked as “InUse”: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/volumemanager/reconciler/reconciler.go#L422 and https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/volumemanager/reconciler/reconciler.go#L441

Now, because the MountVolume.Setup never actually succeeds for this volume (because we return a false positive “success” here: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_mounter.go#L271), the volume is never added to the actual state of the world, and thus, when the subsequent evaluation for Unmount the volume comes along, it doesn’t get processed as the volume isn’t actually in the actual state of the world: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L286-L292

@khushboo-rancher can you test this image: joshimoo/longhorn-manager:ui-path-v1 I broke the broke the corrupt mount point detection. I will have another look tomorrow.

One way I was reproducing this was:

Using Rancher v2.5.8, register my cluster and install cattle-monitoring v2 into it, with prometheus using persistent storage (longhorn class) of any size.

Allow things to come up/stabilize. On the node where the prometheus pod is running, restart the kubelet, and observe the pod goes into terminating state forever.

/cc @khushboo-rancher

Investigated this issue with @joshimoo in a live debug session.

We’re finding that the kubelet is in fact getting stuck in a MountVolume.SetUp loop, and is never able to finish this because it is unable to set ownership on the directory. This can be seen here: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_mounter.go#L271

Interestingly, the Longhorn csi plugin is receiving the NodePublish calls, but is returning essentially a “volume already mounted” as it is not checking the health of the mount, but rather just the fact that the mount point exists. This can be seen here: https://github.com/longhorn/longhorn-manager/blob/0f63e8fae7dc979b72d1529033d5669d22264d56/csi/node_server.go#L205

The csi_client.go attempts to call NodePublish and based on the error result of that, returns different types of errors. https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_client.go#L255-L258 If the error is a final error, it will remove the mount path in question, which will free up the node and have the node attempt to either call NodePublish again (thus remounting the volume) or NodeUnpublish which would be ideal in our case.

Current thought on our end is potentially to be returning an error from the node_server.go if the mount point exists but is unhealthy, which will trigger the removal of the mount point, although I’m not actually sure the kubelet will be able to remove the folder considering it is a mount point, but at least the client will return an final error rather than a transient error.

I think the UncertainProgress error is how we’re accidentally ending up in a MountVolume.SetUp loop: https://github.com/kubernetes/kubernetes/blob/v1.20.7/pkg/volume/csi/csi_mounter.go#L277

@khushboo-rancher https://github.com/longhorn/longhorn-manager/pull/945 has been merged, so the fixes are now also available to test on master.

Working with joshimoo/longhorn-manager:ui-path-v1 image on a RKE1 multi node cluster. The pods are getting restarted after termination in case of kubelet restart.