longhorn: [BUG] Unable to backup volume after NFS server IP change
Describe the bug (š if you encounter this issue)
I tried to stop the NFS server VM instance on AWS EC2 and start it again. The instance public IP changed to a new one. After changing the backup target to the new IP, the volume couldnāt be backed up. The icon turned gray, and the backup page showed an error.
To Reproduce
Steps to reproduce the behavior: Pre-requisite To set up an external NFS server for backup store, perform the following steps:
- Install the nfs-kernel-server package using the following command:
sudo zypper install nfs-kernel-server - Enable and start the rpcbind.service and nfsserver.service services using the following commands:
systemctl enable rpcbind.service
systemctl start rpcbind.service
systemctl enable nfsserver.service
systemctl start nfsserver.service
- Create a directory to export and change its ownership to nobody:nogroup using the following commands:
mkdir /var/nfs
chown nobody:nogroup /var/nfs
- Edit the /etc/exports file and add the following line:
/var/nfs *(rw,no_root_squash,no_subtree_check)
- Run the following command to export the directory:
exportfs -a - In the Longhorn UI, go to Setting -> Backup Target, and set it as nfs://(NFS server IP):/var/nfs
Note: To simulate network disconnection, download the network_down.sh script from the following link: https://github.com/longhorn/longhorn/files/4864127/network_down.sh.zip
The test steps
- I prepared one NFS servers.
- Setup backup target
- Create a volume and then do a backup
- Stop the NFS server instance on AWS EC2 and start it again. The instance public IP changed to a new one
- Update backup target
- Do a backup again
- The volume couldnāt be backed up. The icon turned gray, and the backup page showed an error.

Expected behavior
The volume should be backed up. the backup page would not show an error.
Log or Support bundle
supportbundle_81183726-5b70-4292-a359-3d41e96c9847_2023-05-04T09-29-55Z.zip
Environment
- Longhorn version:
v1.4.x/master - Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version: ubuntu
- CPU per node: 4 core
- Memory per node: 8 GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
- Number of Longhorn volumes in the cluster: 1
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 24 (22 by maintainers)
I discussed with @derekbit the action items below.
This issue is required to fix in 1.4.2.
Yeah, I think the cases are covered. Changing
hardtosoftmode makes error handling more complicated.But it will solve potential performance issues, so it still deserves it. We just need to ensure the coverage w/o regression.
Yes.
EnsureMountPointchecks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpointās type is expected and mountpoint is accessible, so it is not just do blindly unnmout.For v1.4.x, the check of filesystem type is not needed, because it only supports
nfs. But v1.5+, it is a must, becausenfsandcifsuse the same mountpoint.Yes, it is my fix. But
Don't use filesystem.GetMnt(). Use os.statfs insteadis still necessary, because we need to check if the filesystem type is expected.This is due to
filesystem.GetMnt()in https://github.com/longhorn/backupstore/blob/master/util/util.go#L316.GetMnt()iterates and gets information of all mount points in the mount table. If there is any dead mount point, the iteration will hang for a while, so the caller (backup ls) will run into a timeout error in the end.