democratic-csi: NFS mount in node manual mode ends up stuck when node reboots uncleanly.

(I’ll backfill more details in here, just jotting down the skeleton of the problem.)

Problem

I’ve encountered an issue with the node-manual driver – a few times at this point – where an unclean node reboot leads to NFS-backed PV/PVCs becoming stuck in an inconsistent state that ultimately leads to the PVCs not being able to be mounted to their respective pods when the node comes back online.

Context

My setup involves use of the node-manual driver to mount specific NFS paths with dedicated PV/PVCs on application pods. This is all fine and good when things are working.

When one of the aforementioned node reboots happen, naturally the pods are still hanging around in the API, and then eventually the node comes back and the node tries to bring up these pods as it is still set as being responsible for them.

When this occurs, there’s an error shown when describing the pod (which I’ve lost at this point, unfortunately) when contains the string “staging path is not mounted”, and refers to a path on the node ending in globalmount. This seems to be related to the two-step process where a volume is staged on a a node and then “published” so it can actually be used by a workload? Checking the node in question, the path it was referring to indeed didn’t exist, although the directory it referenced did.

I ended up deleting the PV/PVCs (which were stuck in terminating, until…) and then deleting the pods using the stuck volumes, which ended up clearing everything out and allowed me to recreate the pods, which then ended up being able to properly mount the volumes.

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 17 (6 by maintainers)

Most upvoted comments

Didn’t specify attachRequired explicitly but it seems to be set to true by default?

toby@consigliera:~/src/catdad-science-infra$ k get csidriver
NAME                             ATTACHREQUIRED   PODINFOONMOUNT   STORAGECAPACITY   TOKENREQUESTS   REQUIRESREPUBLISH   MODES        AGE
driver.longhorn.io               true             true             false             <unset>         false               Persistent   124d
org.democratic-csi.node-manual   true             true             false             <unset>         false               Persistent   24h

@tobz have you tried the updated node-manual config example in https://github.com/democratic-csi/democratic-csi/issues/324#issuecomment-1963220446 ?

I’ve been using this without issue for a bit now, no issues with unclean reboots.

I hit this on the freenas-nfs driver (not manual) recently as well. It was similar to the node-manual one - the mountpoints were in the mount dir, no files.

So i can now confirm this issue is not exclusive to the node-manual driver.