longhorn: [BUG] Restore from backup sometimes failed if having high frequent recurring backup job w/ retention

Describe the bug Restoring from a backup is failing.

To Reproduce Steps to reproduce the behavior:

  1. Deploy Longhorn and set backup target as S3
  2. Create volume bak on longhorn
  3. Use bak create PV “bak” PVC “myclaim” from longhorn UI
  4. Deploy pod “mypod1”
kind: Pod
apiVersion: v1
metadata:
  name: mypod1
  namespace: default
spec:
  containers:
    - name: testfrontend
      image: nginx
      volumeMounts:
      - mountPath: "/usr/share/nginx/html/"
        name: mypod1
  volumes:
    - name: mypod1
      persistentVolumeClaim:
        claimName: myclaim
  1. In mypod1 shell, execute sh -c “echo ‘Hello from Kubernetes storage’ > /usr/share/nginx/html/index.html”
  2. Create recurring back job, 3 retain, 1 concurrency, every minute
  3. Create manual backup
  4. Restore manual backup to volume v1
  5. Restored volumes becomes faulted

Expected behavior The restore should never fail

Log longhorn-support-bundle_31ac3144-3ee9-4f59-86f5-3f9051e009b8_2021-09-22T23-17-01Z.zip

Environment:

  • Longhorn version: v1.2.x-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 17 (15 by maintainers)

Most upvoted comments

Verified on master-head 20220829

  • longhorn master-head (cccabbf)
  • longhorn-manager master-head ( 76f0fb4)

The test steps

  1. Install Longhorn master-head
  2. Create a volume “bak” on longhorn
  3. Create volume “bak”'s PV(name:bak) & PVC (name:myclaim) from longhorn UI
  4. Deploy pod “mypod1” via cli kubctl apply -f mypod1.yaml
mypod1.yaml
kind: Pod
apiVersion: v1
metadata:
  name: mypod1
  namespace: default
spec:
  containers:
    - name: testfrontend
      image: nginx
      volumeMounts:
      - mountPath: "/usr/share/nginx/html/"
        name: mypod1
  volumes:
    - name: mypod1
      persistentVolumeClaim:
        claimName: myclaim
  1. Enter mypod1 shell kubectl exec --stdin --tty mypod1 -- /bin/bash
  2. execute echo 'Hello from Kubernetes storage' > /usr/share/nginx/html/index.html
  3. Create recurring back job, 3 retain, 1 concurrency, every minute
  4. Create a manual backup
  5. Restore manual backup to volume-v1
  6. Check whether or not the error was printed

Result Pass

After restored, we didn’t observe the error symptoms. longhorn-support-bundle_a334ca16-9b8c-4e6f-822e-711b32e42777_2022-08-29T06-15-10Z.zip .

Update current what I finding.

  1. After discussion with Chris, he did the test with "echo 'Hello from Kubernetes storage' > /usr/share/nginx/html/index.html", and then ssh to the node and run dd if=/dev/urandom of=/dev/longhorn/bak status=progress. This dd command would format the device so makes the /usr/share/nginx/html/index.html gone. So, it’s the reason why restore the volume the file /usr/share/nginx/html/index.html missing.
  2. For type 2 lock acquisition failure is because recurring job retain=3 will delete the older backup every min, and restoring the backup manually. Since the lock is per-volume basis, so during the restoration, the recurring job deletion older backup would not work. So the longhorn-manager Pods floods with error message failed lock backupstore/volumes/17/22/bak/locks/lock-49c8c5b6feca4a36.lck type 2 acquisition.
  3. For the error message Failed to initiate the backup restore, will do revert and cleanup then.. Since longhorn-engine performs restoration per-replica, which means by default we have 3 replicas, each replica will run restoring simultaneously and separately. There might be a possible case that replica1 restoration -> deletion -> replica2 restoration ->replica3 restoration. It would cause the replica2 and replicas3 restoration to fail because acquires the restore lock error (type 1 lock).

So I think Chris hits situation 1 and Kushboo hits situation 3. However, I haven’t reproduced situation 3 locally.

I think no. There is an issue related https://github.com/longhorn/longhorn/issues/3016 which is at version 1.1.2