longhorn: [BUG] Restore from backup sometimes failed if having high frequent recurring backup job w/ retention

Describe the bug Restoring from a backup is failing.

To Reproduce Steps to reproduce the behavior:

Deploy Longhorn and set backup target as S3
Create volume bak on longhorn
Use bak create PV “bak” PVC “myclaim” from longhorn UI
Deploy pod “mypod1”

kind: Pod
apiVersion: v1
metadata:
  name: mypod1
  namespace: default
spec:
  containers:
    - name: testfrontend
      image: nginx
      volumeMounts:
      - mountPath: "/usr/share/nginx/html/"
        name: mypod1
  volumes:
    - name: mypod1
      persistentVolumeClaim:
        claimName: myclaim

In mypod1 shell, execute sh -c “echo ‘Hello from Kubernetes storage’ > /usr/share/nginx/html/index.html”
Create recurring back job, 3 retain, 1 concurrency, every minute
Create manual backup
Restore manual backup to volume v1
Restored volumes becomes faulted

Expected behavior The restore should never fail

Log longhorn-support-bundle_31ac3144-3ee9-4f59-86f5-3f9051e009b8_2021-09-22T23-17-01Z.zip

Environment:

Longhorn version: v1.2.x-head
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 17 (15 by maintainers)

Most upvoted comments

Verified on master-head 20220829

longhorn master-head (cccabbf)
longhorn-manager master-head ( 76f0fb4)

The test steps

Install Longhorn master-head
Create a volume “bak” on longhorn
Create volume “bak”'s PV(name:bak) & PVC (name:myclaim) from longhorn UI
Deploy pod “mypod1” via cli kubctl apply -f mypod1.yaml

mypod1.yaml

kind: Pod
apiVersion: v1
metadata:
  name: mypod1
  namespace: default
spec:
  containers:
    - name: testfrontend
      image: nginx
      volumeMounts:
      - mountPath: "/usr/share/nginx/html/"
        name: mypod1
  volumes:
    - name: mypod1
      persistentVolumeClaim:
        claimName: myclaim

Enter mypod1 shell kubectl exec --stdin --tty mypod1 -- /bin/bash
execute echo 'Hello from Kubernetes storage' > /usr/share/nginx/html/index.html
Create recurring back job, 3 retain, 1 concurrency, every minute
Create a manual backup
Restore manual backup to volume-v1
Check whether or not the error was printed

Result Pass

After restored, we didn’t observe the error symptoms. longhorn-support-bundle_a334ca16-9b8c-4e6f-822e-711b32e42777_2022-08-29T06-15-10Z.zip .

roger-ryao on Oct 13, 2022

Update current what I finding.

After discussion with Chris, he did the test with "echo 'Hello from Kubernetes storage' > /usr/share/nginx/html/index.html", and then ssh to the node and run dd if=/dev/urandom of=/dev/longhorn/bak status=progress. This dd command would format the device so makes the /usr/share/nginx/html/index.html gone. So, it’s the reason why restore the volume the file /usr/share/nginx/html/index.html missing.
For type 2 lock acquisition failure is because recurring job retain=3 will delete the older backup every min, and restoring the backup manually. Since the lock is per-volume basis, so during the restoration, the recurring job deletion older backup would not work. So the longhorn-manager Pods floods with error message failed lock backupstore/volumes/17/22/bak/locks/lock-49c8c5b6feca4a36.lck type 2 acquisition.
For the error message Failed to initiate the backup restore, will do revert and cleanup then.. Since longhorn-engine performs restoration per-replica, which means by default we have 3 replicas, each replica will run restoring simultaneously and separately. There might be a possible case that replica1 restoration -> deletion -> replica2 restoration ->replica3 restoration. It would cause the replica2 and replicas3 restoration to fail because acquires the restore lock error (type 1 lock).

So I think Chris hits situation 1 and Kushboo hits situation 3. However, I haven’t reproduced situation 3 locally.

jenting on Sep 24, 2021

I think no. There is an issue related https://github.com/longhorn/longhorn/issues/3016 which is at version 1.1.2

jenting on Sep 23, 2021