longhorn: [BUG] Deadlock for RWX volume if an error occurs in its share-manager pod

Describe the bug (🐛 if you encounter this issue)

This is a regression introduced by https://github.com/longhorn/longhorn-manager/pull/2294 (for https://github.com/longhorn/longhorn/issues/7106).

When an error occurs in a share-manager pod, its phase transitions to completed. The share-manager controller is unable to restart the pod or update the status of the ShareManager CR because it continuously fails to contact the share manager process to attempt a remount.

Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager controller did not attempt to contact the dead share manager process, so there was no deadlock.

To Reproduce

  1. Install Longhorn v1.6.0-dev.
  2. Deploy the example NGINX deployment (examples/rwx/rwx-nginx-deployment.yaml in the longhorn/longhorn repo).
  3. Identify the share manager pod.
  4. Kill NFS-Ganesha inside the share-manager pod: kubectl exec -n longhorn-system share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211 -- pkill ganesha.nfsd
  5. The share-manager pod remains in the completed phase and is not restarted.
    NAME                                                     READY   STATUS      RESTARTS   AGE
    share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211   0/1     Completed   0          12m
    
  6. The longhorn-manager pod repeatedly logs a failure to sync the share-manager.
    [longhorn-manager-nf4lv] W1122 17:36:53.403245       1 logging.go:59] [core] [Channel #727 SubChannel #728] grpc: addrConn.createTransport failed to connect to {Addr: "10.42.59.152:9600", ServerName: "10.42.59.152:9600", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout"
    [longhorn-manager-nf4lv] time="2023-11-22T17:36:53Z" level=error msg="Failed to sync Longhorn share manager" func=controller.handleReconcileErrorLogging file="utils.go:72" ShareManager=longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211 controller=longhorn-share-manager error="failed to sync longhorn-system/pvc-9216f564-379e-4fd8-861b-e335ccbe8211: failed to mount share manager pod share-manager-pvc-9216f564-379e-4fd8-861b-e335ccbe8211: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.59.152:9600: i/o timeout\"" node=eweber-v125-worker-e472db53-9kz5b
    
  7. The ShareManager CR’s status is not updated. It still shows as running.
    NAME                                       STATE     NODE                                AGE
    pvc-9216f564-379e-4fd8-861b-e335ccbe8211   running   eweber-v125-worker-e472db53-9kz5b   16m
    

Expected behavior

Before https://github.com/longhorn/longhorn-manager/pull/2294, the share-manager pod would be successfully restarted.

Support bundle for troubleshooting

There is a support bundle in the related CNCF Slack thread: https://cloud-native.slack.com/archives/CNVPEL9U3/p1700618585865019

There are other issues in that support bundle as well, so the reproduce may be a bit easier to work with.

Additional context

After https://github.com/longhorn/longhorn-manager/pull/2294, in the share-manager controller, we always attempt a gRPC call to the share-manager pod to do a remount if status.state == running in the ShareManager CR.

https://github.com/longhorn/longhorn-manager/blob/3a66afaa7ec086f7f6ec6ebb10fd6797ac30830d/controller/share_manager_controller.go#L543-L545

However, we do not update the status.state of the ShareManager CR until AFTER this point in the reconcile loop.

https://github.com/longhorn/longhorn-manager/blob/3a66afaa7ec086f7f6ec6ebb10fd6797ac30830d/controller/share_manager_controller.go#L688-L711

So we indefinitely reconcile. We cannot do a remount because the share-manager pod is dead, and we cannot learn the share-manager pod is dead because we error out while attempting to reconcile.

Workaround

https://github.com/longhorn/longhorn/issues/7183#issuecomment-1823715359

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 2
  • Comments: 15 (6 by maintainers)

Most upvoted comments

A workaround is to manually update the status of the share manager resource to error.

kubectl -n longhorn-system patch lhsm pvc-b08581d7-f48c-4b27-8083-7f24b89ba4ea --type=merge --subresource status --patch 'status: {state: error}'