longhorn: [BUG] After the node reconnects, the restoring volume sometimes becomes faulted.

Describe the bug (šŸ› if you encounter this issue)

In restore-volume-node-down , case1 step3:

I try to simulate a power off on the volume-attached node using a disconnect network script for 5, 100, 400, and 1200 seconds. After disconnecting for 400 seconds, I found that the restoring volume sometimes became faulted after the node reconnected. I was able to reproduce this issue on v1.4.1. Therefore, I believe that I should create a ticket to have this behavior checked to determine whether it is expected or not.

Screenshot_20230327_142347

To Reproduce

Steps to reproduce the behavior:

  1. Create a volume, pv and pvc vol-0
  2. Launch a pod with Longhorn volume. kubectl apply -f pod_mount_1vol.yaml
pod_mount_1vol.yaml
kind: Pod
apiVersion: v1
metadata:
  name: ubuntu-mountvol0
  namespace: default
spec:
  containers:
    - name: ubuntu
      image: ubuntu
      command: ["/bin/sleep", "3650d"]
      volumeMounts:
      - mountPath: "/data/"
        name: vol-0
  volumes:
    - name: vol-0
      persistentVolumeClaim:
        claimName: vol-0
  1. Write 1.2G data to the volume and create a backup. kubectl exec -it ubuntu-mountvol0 -- /bin/bash -c "dd if=/dev/urandom of=/data/1.2g bs=1M count=1200 oflag=direct status=progress && md5sum /data/1.2g"
  2. Create a restore volume from the backup.
  3. Disconnect the network of the node that the volume attached to for 400 seconds during the restoring.
  4. Wait for the Longhorn node down.
  5. Wait for the restore volume being reattached and starting restoring volume with state Degraded.
  6. Wait for the restore complete.
  7. Attach the volume and verify the restored data.
  8. Verify the volume works fine.

Expected behavior

The volume restore should be completed.

Log or Support bundle

longhorn-support-bundle_c15ffec2-0b12-496f-b7c6-aa3f64af6826_2023-03-27T06-23-28Z.zip

Environment

  • Longhorn version: v1.3.3-rc2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster:3
  • Node config
    • OS type and version:
    • CPU per node:4
    • Memory per node: 16g
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 1

Additional context

https://confluence.suse.com/pages/viewpage.action?spaceKey=LON&title=Negative+Testing https://longhorn.github.io/longhorn-tests/manual/pre-release/backup-and-restore/restore-volume-node-down/

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

@roger-ryao could you help to confirm the behavior when restoring(revert) the existing volume’s backup(snapshot) ?

Hi @c3y1huang : I was not able to reproduce the issue when restoring (reverting) the existing volume’s backup (snapshot).


Steps to reproduce the behavior:

  1. Create a volume, persistent volume (PV), and persistent volume claim (PVC) with the name vol-0
  2. Launch a pod with a Longhorn volume by applying the configuration in the file pod_mount_1vol.yaml using the command kubectl apply -f pod_mount_1vol.yaml.
pod_mount_1vol.yaml
kind: Pod
apiVersion: v1
metadata:
  name: ubuntu-mountvol0
  namespace: default
spec:
  containers:
    - name: ubuntu
      image: ubuntu
      command: ["/bin/sleep", "3650d"]
      volumeMounts:
      - mountPath: "/data/"
        name: vol-0
  volumes:
    - name: vol-0
      persistentVolumeClaim:
        claimName: vol-0
  1. Write 1.2 GB of random data to the volume using the command kubectl exec -it ubuntu-mountvol0 -- /bin/bash -c 'dd if=/dev/urandom of=/data/1.2g bs=1M count=1200 oflag=direct status=progress && md5sum /data/1.2g', and create a backup of the volume.
  2. Write 12 MB of random data to the volume.
  3. Delete the pod to ensure that the volume is detached.
  4. Attach the volume to a node with maintenance mode enabled.
  5. Revert the volume’s backup from step 3
  6. Disconnect the network of the node that the volume is attached to for 400 seconds.
  7. Wait for the Longhorn node down.
  8. Wait for the volume to be reattached.
  9. Detach the volume.
  10. Launch pod with the volume, and verify that the restored data is present.
  11. Verify the volume works fine.

So need to check why it’s not salvaged in this case.

> k get volume -o yaml | grep restore
    restoreVolumeRecurringJob: ignored
      message: All replica restore failed and the volume became Faulted
      type: restore
    restoreInitiated: true
    restoreRequired: true

When the engine process fails to communicate with a replica, all replicas are marked ERR, and the volume’s RestoreRequired status cannot be updated to false. Hence, blocking auto-salvage. By the current design, Longhorn is relying on the RestoredRequired flag to determine the completion of the restores. So the behavior described in this issue is expected.

On the other hand, restoring backup within exiting volume calls snapshot revert, which is different from the volume backup restore process. So we don’t expect to hit this issue with the backup(snapshot) revert.

@roger-ryao could you help to confirm the behavior when restoring(revert) the existing volume’s backup(snapshot) ?

2023-03-27T06:19:46.872651899Z [vol-0-e-d874c274] time="2023-03-27T06:19:46Z" level=error msg="Backend tcp://10.42.1.14:10000 monitoring failed, mark as ERR: ping failure: replica state closed"
2023-03-27T06:19:46.872666912Z time="2023-03-27T06:19:46Z" level=info msg="Setting replica tcp://10.42.1.14:10000 to mode ERR"
2023-03-27T06:19:46.872671904Z time="2023-03-27T06:19:46Z" level=info msg="Monitoring stopped tcp://10.42.1.14:10000"
2023-03-27T06:19:46.883844032Z [vol-0-e-d874c274] time="2023-03-27T06:19:46Z" level=error msg="Backend tcp://10.42.3.12:10000 monitoring failed, mark as ERR: ping failure: replica state closed"
2023-03-27T06:19:46.883857971Z time="2023-03-27T06:19:46Z" level=info msg="Setting replica tcp://10.42.3.12:10000 to mode ERR"

According to the Node disconnection test, this behavior is only expected if the auto-salvage is disabled.

So need to check why it’s not salvaged in this case.

Ref:

@c3y1huang could u update the test cases with clarified info mentioned above, then we can close w/o any fixings, since this is expected.

https://github.com/longhorn/longhorn-tests/pull/1296

@roger-ryao one more request , is it able to reproduce in 1.3.2?

Hi @innobead : I was able to reproduce the issue not only on v1.3.2, but also on v1.4.0. Here is the support bundle for your reference.

Version Snapshot
1.3.2 132-longhorn-support-bundle.zip
1.4.0 140-longhorn-support-bundle.zip

@roger-ryao one more request , is it able to reproduce in 1.3.2?

@roger-ryao did this only happen in the master/1.4.x? how about 1.3?

I could reproduce this on 1.3.3-RC2 and 1.4.1, but I don’t test it on master yet.

I was able to reproduce the issue on the master-head. supportbundle_ae12f40a-8c91-42e6-93e0-f5ee6f57b315_2023-03-28T04-28-23Z.zip

@roger-ryao was the volume able to salvage and turned back to a healthy state? and actually, the restored volume worked as expected w/o data intact.

Hi @innobead : I didn’t disable Automatic Salvage ; it was enabled. However, the volume’s robustness status remained faulted and did not return to a healthy state.

@roger-ryao did this only happen in the master/1.4.x? how about 1.3?

I could reproduce this on 1.3.3-RC2 and 1.4.1, but I don’t test it on master yet.