longhorn: [BUG] Deadlock for RWX volume if an error occurs in its share-manager pod

Describe the bug (🐛 if you encounter this issue)

This is a regression introduced by https://github.com/longhorn/longhorn-manager/pull/2294 (for https://github.com/longhorn/longhorn/issues/7106).

When an error occurs in a share-manager pod, its phase transitions to completed. The share-manager controller is unable to restart the pod or update the status of the ShareManager CR because it continuously fails to contact the share manager process to attempt a remount.

Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager controller did not attempt to contact the dead share manager process, so there was no deadlock.

To Reproduce

Install Longhorn v1.6.0-dev.
Deploy the example NGINX deployment (examples/rwx/rwx-nginx-deployment.yaml in the longhorn/longhorn repo).
Identify the share manager pod.
Kill NFS-Ganesha inside the share-manager pod: kubectl exec -n longhorn-system share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 -- pkill ganesha.nfsd

The share-manager pod remains in the completed phase and is not restarted.

NAME                                                     READY   STATUS      RESTARTS   AGE
share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211   0/1     Completed   0          12m

The longhorn-manager pod repeatedly logs a failure to sync the share-manager.

[longhorn-manager-nf4lv] W1122 17:36:53.403245       1 logging.go:59] [core] [Channel #727 SubChannel #728] grpc: addrConn.createTransport failed to connect to {Addr: "10.42.59.152:9600", ServerName: "10.42.59.152:9600", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout"
[longhorn-manager-nf4lv] time="2023-11-22T17:36:53Z" level=error msg="Failed to sync Longhorn share manager" func=controller.handleReconcileErrorLogging file="utils.go:72" ShareManager=longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211 controller=longhorn-share-manager error="failed to sync longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211: failed to mount share manager pod share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout\"" node=eweber-v125-worker-e472db53-9kz5b

The ShareManager CR’s status is not updated. It still shows as running.

NAME                                       STATE     NODE                                AGE
pvc-9216f564-379e-4fd8-861b-e335ccbe8211   running   eweber-v125-worker-e472db53-9kz5b   16m

Expected behavior

Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager pod would be successfully restarted.

Support bundle for troubleshooting

There is a support bundle in the related CNCF Slack thread: https://cloud-native.slack.com/archives/CNVPEL9U3/p1700618585865019

There are other issues in that support bundle as well, so the reproduce may be a bit easier to work with.

Additional context

After https://github.com/longhorn/longhorn-manager/pull/2294, in the share-manager controller, we always attempt a gRPC call to the share-manager pod to do a remount if status.state == running in the ShareManager CR.

https://github.com/longhorn/longhorn-manager/blob/3a66afaa7ec086f7f6ec6ebb10fd6797ac30830d/controller/share_manager_controller.go#L543-L545

However, we do not update the status.state of the ShareManager CR until AFTER this point in the reconcile loop.

https://github.com/longhorn/longhorn-manager/blob/3a66afaa7ec086f7f6ec6ebb10fd6797ac30830d/controller/share_manager_controller.go#L688-L711

So we indefinitely reconcile. We cannot do a remount because the share-manager pod is dead, and we cannot learn the share-manager pod is dead because we error out while attempting to reconcile.

Workaround

https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359

About this issue

Original URL
State: closed
Created 7 months ago
Reactions: 2
Comments: 15 (6 by maintainers)

Most upvoted comments

A workaround is to manually update the status of the share manager resource to error.

kubectl -n longhorn-system patch lhsm pvc-b08581d7-f48c-4b27-8083-7f24b89ba4ea --type=merge --subresource status --patch 'status: {state: error}'

derekbit on Nov 23, 2023

@voarsh2 Yes.

derekbit on Nov 23, 2023