longhorn: [BUG] Unable to backup volume after NFS server IP change

Describe the bug (🐛 if you encounter this issue)

I tried to stop the NFS server VM instance on AWS EC2 and start it again. The instance public IP changed to a new one. After changing the backup target to the new IP, the volume couldn’t be backed up. The icon turned gray, and the backup page showed an error.

To Reproduce

Steps to reproduce the behavior: Pre-requisite To set up an external NFS server for backup store, perform the following steps:

Install the nfs-kernel-server package using the following command: sudo zypper install nfs-kernel-server
Enable and start the rpcbind.service and nfsserver.service services using the following commands:

systemctl enable rpcbind.service
systemctl start rpcbind.service
systemctl enable nfsserver.service
systemctl start nfsserver.service

Create a directory to export and change its ownership to nobody:nogroup using the following commands:

mkdir /var/nfs
chown nobody:nogroup /var/nfs

Edit the /etc/exports file and add the following line:

/var/nfs     *(rw,no_root_squash,no_subtree_check)

Run the following command to export the directory: exportfs -a
In the Longhorn UI, go to Setting -> Backup Target, and set it as nfs://(NFS server IP):/var/nfs

Note: To simulate network disconnection, download the network_down.sh script from the following link: https://github.com/longhorn/longhorn/files/4864127/network_down.sh.zip

The test steps

I prepared one NFS servers.
Setup backup target
Create a volume and then do a backup
Stop the NFS server instance on AWS EC2 and start it again. The instance public IP changed to a new one
Update backup target
Do a backup again
The volume couldn’t be backed up. The icon turned gray, and the backup page showed an error.

Expected behavior

The volume should be backed up. the backup page would not show an error.

Log or Support bundle

supportbundle_81183726-5b70-4292-a359-3d41e96c9847_2023-05-04T09-29-55Z.zip

Environment

Longhorn version: v1.4.x / master
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: ubuntu
- CPU per node: 4 core
- Memory per node: 8 GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 1

Additional context

Add any other context about the problem here.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 24 (22 by maintainers)

Most upvoted comments

I discussed with @derekbit the action items below.

When ensuring the mount point, check the specific mount point instead of scanning the whole mount table. https://github.com/longhorn/backupstore/pull/129
There are two places to create a backup target mount point (in replica and longhorn-manager), so when reconciling the backup target setting change, go clean up the old mount points from longhorn-manager. (@ChanYiLin has done the cleaning when the backup target is unset, so we probably can consolidate both) @derekbit will work on this.

This issue is required to fix in 1.4.2.

innobead on May 5, 2023

Yeah, I think the cases are covered. Changing hard to soft mode makes error handling more complicated.

derekbit on May 4, 2023

Yeah, I think the cases are covered. Changing hard to soft mode makes error handling more complicated.

But it will solve potential performance issues, so it still deserves it. We just need to ensure the coverage w/o regression.

innobead on May 4, 2023

Yes.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right?

EnsureMountPoint checks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpoint’s type is expected and mountpoint is accessible, so it is not just do blindly unnmout.

derekbit on May 4, 2023

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

I see, that makes sense. However, one question, right now we only have one backup target, so if blindly clean up the mount point in this case w/o checking the filesystem type, will be there any side effects?

In longhorn/backupstore@149c9aa3 fix, it’s different from this scenario, because it’s to clean up mount points if the backup target is reset, so it can blindly cleanup w/o any checks.

For v1.4.x, the check of filesystem type is not needed, because it only supports nfs. But v1.5+, it is a must, because nfs and cifs use the same mountpoint.

derekbit on May 4, 2023

I will do two improvements

Don’t use filesystem.GetMnt(). Use os.statfs instead

Move the cleanup logic to longhorn-manager like [IMPROVEMENT] Clean up backup target if the backup target setting is unset #5655 WDYT?

I mean can we just clean up it directly? since cleanupMount is the best effort?
	// mnt, err := filesystem.GetMount(mountPoint)
	// if err != nil {
	// 	return true, errors.Wrapf(err, "failed to get mount for %v", mountpoint)
	// }

	// if strings.Contains(mnt.FilesystemType, Kind) {
	// 	return true, nil
	// }

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, mnt.FilesystemType, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", mnt.FilesystemType, mountPoint, Kind)
	}

	return false, nil

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

derekbit on May 4, 2023

This is due tofilesystem.GetMnt() in https://github.com/longhorn/backupstore/blob/master/util/util.go#L316. GetMnt() iterates and gets information of all mount points in the mount table. If there is any dead mount point, the iteration will hang for a while, so the caller (backup ls) will run into a timeout error in the end.

derekbit on May 4, 2023