longhorn: [BUG] RWX volume is stuck at detaching when the attached node is down

Describe the bug (๐Ÿ› if you encounter this issue)

The RWX volume is stuck at detaching when the attached node is down. The root cause is the side effect of the fix https://github.com/longhorn/longhorn/issues/5507. To prevent that nfs-ganesha sever mounts the dead volume, the engine controller tries to delete the engine instance. If the attached node is down and does not come back, the deletion wonโ€™t succeed and then retry again and again. Then, the volume will be stuck at detaching state.

To Reproduce

Steps to reproduce the behavior:

  1. Go to โ€˜โ€ฆโ€™
  2. Click on โ€˜โ€ฆโ€™
  3. Perform โ€˜โ€ฆโ€™
  4. See error

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managersโ€™ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version: v1.4.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Workaround

  • Delete the down node or
  • Delete the unknown instancemanager of the down node
    • list the instancemanager by
      kubectl -n longhorn-system get instancemanagers
      
      Then, you can see the unknown instance-manager-e-xxx which is on the down node and delete it .

Additional context

Add any other context about the problem here.

https://cloud-native.slack.com/archives/CNVPEL9U3/p1678784583120039

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 2
  • Comments: 25 (11 by maintainers)

Most upvoted comments

kubectl -n longhorn-system get instancemanagers & kubectl -n longhorn-system delete instancemanagers <unknown-pod>

This will be temp workaround and this works to me (in V1.4.1)

Thank you for the continuous support @derekbit

We have test cases for RWX volumes with node down/restart/kubelet restart etc - https://longhorn.github.io/longhorn-tests/manual/pre-release/ha/single-replica-node-down/ , https://longhorn.github.io/longhorn-tests/manual/pre-release/node/kubelet-restart-on-a-node/

I feel most of the node down/reboot test cases are scenario based and we can make them better by consolidating them. Having these test cases at one place (under one category) will reduce the chance of missing them.

cc @longhorn/qa Do we have node down case for RWX volume (share manager pod)?

Hi @innobead I think I may have observed the same symptoms as the user mentioned in the Slack channel. I also asked a question about it on https://github.com/longhorn/longhorn/issues/5488#issuecomment-1467838260. I searched for the node down case for RWX volumes in our manual test cases this afternoon, but I didnโ€™t find it.

The behavior of the Longhorn storage system when the node network is disconnected on the master head is different in comparison to version 1.3.3-rc1. As a result, I need more time to perform additional testing to identify any potential issues related to this scenario. Therefore, I will focus on testing the 1.3.3 release first, and then I will verify this issue again.