longhorn: [BUG] Cannot detach volume

Describe the bug (🐛 if you encounter this issue)

No pod is using the volume, but it cannot be detached because it immediately gets reattached automatically.

From what I can see, the corresponding VolumeAttachment object has an attachment ticket that cannot be deleted (if deleted, it’s immediately recreated):

spec:
  attachmentTickets:
    volume-rebuilding-controller-pvc-150c0bbc-25d6-4fe2-b76b-6bce680d6975:
      generation: 0
      id: volume-rebuilding-controller-pvc-150c0bbc-25d6-4fe2-b76b-6bce680d6975
      nodeID: rkemetal1
      parameters:
        disableFrontend: 'true'
      type: volume-rebuilding-controller
  volume: pvc-150c0bbc-25d6-4fe2-b76b-6bce680d6975

To Reproduce

Don’t know.

Expected behavior

The volume should stay in the “detached” state.

Support bundle for troubleshooting

Please remove the bundle once you’ve downloaded it.

Environment

Longhorn 1.5.1.

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 18 (13 by maintainers)

Most upvoted comments

In the meantime, I think we got all the data we possibly get from the user so we should proceed to get the user out of this stuck situation. You can try this workaround @h-e-l-o:

Workaround:

  1. Get all problematic volumes by kubectl get volumes.longhorn.io -o json -n longhorn-system | jq -r '.items[] | select(.status.offlineReplicaRebuildingRequired == true) | .metadata.name'
  2. Update each volume by kubectl patch volumes.longhorn.io <VOLUME-NAME> --type=merge --subresource status --patch 'status: {offlineReplicaRebuildingRequired: false}' -n longhorn-system

Ok I’ve found that using:

kubectl patch lhv -n longhorn-system pvc-150c0bbc-25d6-4fe2-b76b-6bce680d6975 --type=merge --subresource status --patch 'status: {offlineReplicaRebuildingRequired: false}'

allows me to set the offlineReplicaRebuildingRequired status to false and I’m able to detach the volume afterwards.

We did a live code analysis with @ejweber and @james-munson and have a theory of what could have gone wrong. I am trying to reproduce that theory

Just in case, I’ve disabled the Offline Replica Rebuilding option, which for some reason was enabled, as it seems to be used/related only to v2 engine.

Yeah, feel free to disable it.

It looks like 21 volumes in this cluster have volume.status.offLineReplicaRebuildingRequired == true, and that has caused most (all?) of them to gain similar attachmentTickets.

From a quick check of the code, we should not ever set volume.status.offLineReplicaRebuildingRequired = true unless a volume is using the v2 engine. There is no evidence that any volumes are using the v2 engine, so further investigation is required.