longhorn: [BUG] Volume unable to recover when upgrading several StatefulSets

A volume is stuck with Unknown health, and several can’t be mounted because they’re already attached. This probably happened after restarting the Longhorn manager instances.

Warning FailedAttachVolume 3m22s (x36 over 60m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-ef1550da-b248-4b74-995f-4af189bfcfaa” : rpc error: code = Aborted desc = The volume pvc-ef1550da-b248-4b74-995f-4af189bfcfaa is already attached but it is not ready for workloads

and

Warning FailedAttachVolume 3m5s (x36 over 60m) attachdetach-controller AttachVolume.Attach failed for volume “pvc-1a16cfdb-6b14-472a-a104-c8a654da0102” : rpc error: code = FailedPrecondition desc = The volume pvc-1a16cfdb-6b14-472a-a104-c8a654da0102 cannot be attached to the node connect-b since it is already attached to the node connect-a

Screen Shot 2021-03-11 at 1 28 43 PM

I’m surprised they aren’t automatically fixed based on https://longhorn.io/docs/1.1.0/high-availability/recover-volume/.


Version: v1.1.0

Support bundle attached: longhorn-support-bundle_ee48a964-d6ec-4141-af7a-84f62c5cea44_2021-03-11T21-31-07Z.zip

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19 (12 by maintainers)

Most upvoted comments

I was able to recover all volumes by: stopping the pods attempting to use them, detaching the volumes in the UI, then starting the pods.

Ideally this would be done automatically.

For volume pvc-ef1550da-b248-4b74-995f-4af189bfcfaa-e-b5bb8817

Workaround: Directly delete the longhorn manager pod on node connect-a that the volume/engine is also on. Then Longhorn will automatically restart the longhorn manager pod, mark the volume as Faulted, then recover the volume.

Root cause:

  1. All replicas are state error but the engine is still state running. Ideally, the engine should error out and the volume should become Faulted, then auto salvage will help recover the volume. But the field spec.failedAt in replica pvc-ef1550da-b248-4b74-995f-4af189bfcfaa-r-73cbddfa is empty, which doesn’t match its state error. This replica blocks the flow.
  2. This problematic replica is the only replica in engine currentreplicaaddressmap: pvc-ef1550da-b248-4b74-995f-4af189bfcfaa-r-73cbddfa: 10.32.0.8:10015. But somehow there is no record in engine replicamodemap. Based on the implementation, Longhorn will rely on the record in replicamodemap to set spec.failedAt for the replica. Actually the replica is still in the engine process with mode ERR
2021-03-11T20:39:05.585916097Z [pvc-ef1550da-b248-4b74-995f-4af189bfcfaa-e-b5bb8817] time="2021-03-11T20:39:05Z" level=error msg="Backend tcp://10.32.0.8:10015 monitoring failed, mark as ERR: EOF"
2021-03-11T20:39:05.585938846Z time="2021-03-11T20:39:05Z" level=info msg="Set replica tcp://10.32.0.8:10015 to mode ERR"
2021-03-11T20:39:05.585946696Z time="2021-03-11T20:39:05Z" level=info msg="Monitoring stopped tcp://10.32.0.8:10015"
  1. According to the following logs, there is another longhorn manager somehow takes over the engine from the node that the engine supposed to be on. Then the current node connect-a stops monitoring the engine and cleans up replicamodemap, endpoint, as well as some other fields. I am not sure if the wrong owner id change is caused by the cluster node inconsistency. e.g.: There is a node disconnected from others but the longhorn manager on this node considers that other nodes as unavailable and itself is still healthy.
2021-03-11T20:18:36.934454632Z time="2021-03-11T20:18:36Z" level=debug msg="Requeue volume due to error <nil> or Operation cannot be fulfilled on replicas.longhorn.io \"pvc-ef1550da-b248-4b74-995f-4af189bfcfaa-r-b46ed366\": the object has been modified; please apply your changes to the latest version and try again" accessMode=rwo controller=longhorn-volume frontend=blockdev node=connect-a owner=connect-a state=attached volume=pvc-ef1550da-b248-4b74-995f-4af189bfcfaa
2021-03-11T20:18:37.134330941Z time="2021-03-11T20:18:37Z" level=info msg="stop monitoring because the engine is no longer running on node" controller=longhorn-engine engine=pvc-ef1550da-b248-4b74-995f-4af189bfcfaa-e-b5bb8817 node=connect-a
  1. Actually node connect-a gets the engine back later. But the engine controller on this node didn’t restart the monitoring. This is a longhorn bug: When the current engine controller stops the monitoring with this function(due to the owner ID), the engine monitor map is not cleaned up. When the engine is back later, the engine controller will wrongly consider that it’s still monitoring the engine.

For volume pvc-1a16cfdb-6b14-472a-a104-c8a654da0102

Workaround: Manually detach the volume (via Longhorn UI) and clean up the related VolumeAttachment csi-569be94630dc21616bb66ba5383a0930da5a6d0537f29457642d021d9a4cf5b0:

kubectl delete volumeattachment csi-569be94630dc21616bb66ba5383a0930da5a6d0537f29457642d021d9a4cf5b0` and kubectl patch -p '{"metadata":{"finalizers": null}}' volumeattachment csi-569be94630dc21616bb66ba5383a0930da5a6d0537f29457642d021d9a4cf5b0

Then wait for Kubernetes to retry the attachment for the workload.

Root cause: When the volume is crashed then becomes detached (the crash seems to be caused by the instance manager pod crash), Kubernetes sends a CSI detach request to the CSI plugin. At that time, though spec.nodeID is not empty, the volume is temporarily become detached then the plugin will wrongly response that the detachment success. As a result, Kubernetes considers this volume as ready to be attached to another node. This is also a Longhorn bug. And I had encountered and reported this bug before. But I cannot find the issue number.

Reproduce steps:

  • create volume + pod with single replica + data locality
  • turn off node

Problem 01: kubernetes + longhorn out of sync

  • before csi changes, kubernetes and longhorn would go out of sync and kubernetes could no longer attach the volume since it’s still desired to be attached to the prior node v.spec.NodeID is not unset.

Problem 02: salvage deadloop preventing volume updates

  • after csi changes or let’s say you manually unset the v.spec.NodeID via a detach from the UI or manual salvage of the replica while it’s still on the downed node (manual salvage will first unset v.spec.nodeID) and then process the replicas (this operation will fail since the replica is down)
  • the volume will now constantly fail to update since the isNodeDownCheck(v.spec.NodeID) https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L1033 will always return error

Problem 03: monitoring / ownership issue, @PhanLe1010 is working on this in https://github.com/longhorn/longhorn-manager/pull/854

Test Setup: This fixes issues with salvage of multiple replicas as well, it’s just easier to test with a single replica case.

  • This should be tested with data locality enabled/disabled.
  • This should be tested with replicaset / statefulset
  • This should be tested with RWO/RWX volumes

Test steps:

  • create RWO|RWX volume with replica count = 1 & data locality = enabled|disabled
  • create deployment|statefulset for volume
  • power down node of volume/replica
  • wait for pod deletion & recreation
  • volume will fail to attach since volume is not ready (i.e remains faulted, since single replica is on downed node)
  • power up node
  • verify auto salvage finishes (i.e pod completes start)
  • verify volume attached & accessible by pod (i.e test data is available)