longhorn: [BUG] Deadlock for RWX volume if an error occurs in its share-manager pod
Describe the bug (🐛 if you encounter this issue)
This is a regression introduced by https://github.com/longhorn/longhorn-manager/pull/2294 (for https://github.com/longhorn/longhorn/issues/7106).
When an error occurs in a share-manager pod, its phase transitions to completed. The share-manager controller is unable to restart the pod or update the status of the ShareManager CR because it continuously fails to contact the share manager process to attempt a remount.
Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager controller did not attempt to contact the dead share manager process, so there was no deadlock.
To Reproduce
- Install Longhorn v1.6.0-dev.
- Deploy the example NGINX deployment (
examples/rwx/rwx-nginx-deployment.yamlin the longhorn/longhorn repo). - Identify the share manager pod.
- Kill NFS-Ganesha inside the share-manager pod:
kubectl exec -n longhorn-system share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 -- pkill ganesha.nfsd - The share-manager pod remains in the
completedphase and is not restarted.NAME READY STATUS RESTARTS AGE share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 0/1 Completed 0 12m - The longhorn-manager pod repeatedly logs a failure to sync the share-manager.
[longhorn-manager-nf4lv] W1122 17:36:53.403245 1 logging.go:59] [core] [Channel #727 SubChannel #728] grpc: addrConn.createTransport failed to connect to {Addr: "10.42.59.152:9600", ServerName: "10.42.59.152:9600", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout" [longhorn-manager-nf4lv] time="2023-11-22T17:36:53Z" level=error msg="Failed to sync Longhorn share manager" func=controller.handleReconcileErrorLogging file="utils.go:72" ShareManager=longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211 controller=longhorn-share-manager error="failed to sync longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211: failed to mount share manager pod share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout\"" node=eweber-v125-worker-e472db53-9kz5b - The ShareManager CR’s status is not updated. It still shows as
running.NAME STATE NODE AGE pvc-9216f564-379e-4fd8-861b-e335ccbe8211 running eweber-v125-worker-e472db53-9kz5b 16m
Expected behavior
Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager pod would be successfully restarted.
Support bundle for troubleshooting
There is a support bundle in the related CNCF Slack thread: https://cloud-native.slack.com/archives/CNVPEL9U3/p1700618585865019
There are other issues in that support bundle as well, so the reproduce may be a bit easier to work with.
Additional context
After https://github.com/longhorn/longhorn-manager/pull/2294, in the share-manager controller, we always attempt a gRPC call to the share-manager pod to do a remount if status.state == running in the ShareManager CR.
However, we do not update the status.state of the ShareManager CR until AFTER this point in the reconcile loop.
So we indefinitely reconcile. We cannot do a remount because the share-manager pod is dead, and we cannot learn the share-manager pod is dead because we error out while attempting to reconcile.
Workaround
https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 2
- Comments: 15 (6 by maintainers)
A workaround is to manually update the status of the share manager resource to
error.kubectl -n longhorn-system patch lhsm pvc-b08581d7-f48c-4b27-8083-7f24b89ba4ea --type=merge --subresource status --patch 'status: {state: error}'@voarsh2 Yes.