longhorn: [BUG] Unable to backup volume after NFS server IP change

Describe the bug (šŸ› if you encounter this issue)

I tried to stop the NFS server VM instance on AWS EC2 and start it again. The instance public IP changed to a new one. After changing the backup target to the new IP, the volume couldn’t be backed up. The icon turned gray, and the backup page showed an error.

To Reproduce

Steps to reproduce the behavior: Pre-requisite To set up an external NFS server for backup store, perform the following steps:

  1. Install the nfs-kernel-server package using the following command: sudo zypper install nfs-kernel-server
  2. Enable and start the rpcbind.service and nfsserver.service services using the following commands:
systemctl enable rpcbind.service
systemctl start rpcbind.service
systemctl enable nfsserver.service
systemctl start nfsserver.service
  1. Create a directory to export and change its ownership to nobody:nogroup using the following commands:
mkdir /var/nfs
chown nobody:nogroup /var/nfs
  1. Edit the /etc/exports file and add the following line:
/var/nfs     *(rw,no_root_squash,no_subtree_check)
  1. Run the following command to export the directory: exportfs -a
  2. In the Longhorn UI, go to Setting -> Backup Target, and set it as nfs://(NFS server IP):/var/nfs

Note: To simulate network disconnection, download the network_down.sh script from the following link: https://github.com/longhorn/longhorn/files/4864127/network_down.sh.zip

The test steps

  1. I prepared one NFS servers.
  2. Setup backup target
  3. Create a volume and then do a backup
  4. Stop the NFS server instance on AWS EC2 and start it again. The instance public IP changed to a new one
  5. Update backup target
  6. Do a backup again
  7. The volume couldn’t be backed up. The icon turned gray, and the backup page showed an error. Screenshot_20230504_172848 Screenshot_20230504_172950

Expected behavior

The volume should be backed up. the backup page would not show an error.

Log or Support bundle

supportbundle_81183726-5b70-4292-a359-3d41e96c9847_2023-05-04T09-29-55Z.zip

Environment

  • Longhorn version: v1.4.x / master
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: ubuntu
    • CPU per node: 4 core
    • Memory per node: 8 GB
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 1

Additional context

Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 24 (22 by maintainers)

Most upvoted comments

I discussed with @derekbit the action items below.

  • When ensuring the mount point, check the specific mount point instead of scanning the whole mount table. https://github.com/longhorn/backupstore/pull/129
  • There are two places to create a backup target mount point (in replica and longhorn-manager), so when reconciling the backup target setting change, go clean up the old mount points from longhorn-manager. (@ChanYiLin has done the cleaning when the backup target is unset, so we probably can consolidate both) @derekbit will work on this.

This issue is required to fix in 1.4.2.

Yeah, I think the cases are covered. Changing hard to soft mode makes error handling more complicated.

Yeah, I think the cases are covered. Changing hard to soft mode makes error handling more complicated.

But it will solve potential performance issues, so it still deserves it. We just need to ensure the coverage w/o regression.

Yes.

I still want to clarify a bit. When doing unmount, we can just umount it no matter what type of mount, right?

EnsureMountPoint checks if the mount point is valid. If it is invalid, clean up mount point. So, the purpose is to make sure the mountpoint’s type is expected and mountpoint is accessible, so it is not just do blindly unnmout.

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

I see, that makes sense. However, one question, right now we only have one backup target, so if blindly clean up the mount point in this case w/o checking the filesystem type, will be there any side effects?

In longhorn/backupstore@149c9aa3 fix, it’s different from this scenario, because it’s to clean up mount points if the backup target is reset, so it can blindly cleanup w/o any checks.

For v1.4.x, the check of filesystem type is not needed, because it only supports nfs. But v1.5+, it is a must, because nfs and cifs use the same mountpoint.

I will do two improvements

I mean can we just clean up it directly? since cleanupMount is the best effort?

	// mnt, err := filesystem.GetMount(mountPoint)
	// if err != nil {
	// 	return true, errors.Wrapf(err, "failed to get mount for %v", mountpoint)
	// }

	// if strings.Contains(mnt.FilesystemType, Kind) {
	// 	return true, nil
	// }

	log.Warnf("Cleaning up the mount point %v because the fstype %v is changed to %v", mountPoint, mnt.FilesystemType, Kind)

	if mntErr := cleanupMount(mountPoint, mounter, log); mntErr != nil {
		return true, errors.Wrapf(mntErr, "failed to clean up mount point %v (%v) for %v protocol", mnt.FilesystemType, mountPoint, Kind)
	}

	return false, nil

Yes, it is my fix. But Don't use filesystem.GetMnt(). Use os.statfs instead is still necessary, because we need to check if the filesystem type is expected.

This is due tofilesystem.GetMnt() in https://github.com/longhorn/backupstore/blob/master/util/util.go#L316. GetMnt() iterates and gets information of all mount points in the mount table. If there is any dead mount point, the iteration will hang for a while, so the caller (backup ls) will run into a timeout error in the end.