longhorn: [BUG] Restore from backup sometimes failed if having high frequent recurring backup job w/ retention
Describe the bug Restoring from a backup is failing.
To Reproduce Steps to reproduce the behavior:
- Deploy Longhorn and set backup target as S3
- Create volume bak on longhorn
- Use bak create PV “bak” PVC “myclaim” from longhorn UI
- Deploy pod “mypod1”
kind: Pod
apiVersion: v1
metadata:
name: mypod1
namespace: default
spec:
containers:
- name: testfrontend
image: nginx
volumeMounts:
- mountPath: "/usr/share/nginx/html/"
name: mypod1
volumes:
- name: mypod1
persistentVolumeClaim:
claimName: myclaim
- In mypod1 shell, execute sh -c “echo ‘Hello from Kubernetes storage’ > /usr/share/nginx/html/index.html”
- Create recurring back job, 3 retain, 1 concurrency, every minute
- Create manual backup
- Restore manual backup to volume v1
- Restored volumes becomes faulted
Expected behavior The restore should never fail
Log longhorn-support-bundle_31ac3144-3ee9-4f59-86f5-3f9051e009b8_2021-09-22T23-17-01Z.zip
Environment:
- Longhorn version: v1.2.x-head
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (15 by maintainers)
Verified on master-head 20220829
The test steps
kubctl apply -f mypod1.yamlmypod1.yaml
kubectl exec --stdin --tty mypod1 -- /bin/bashecho 'Hello from Kubernetes storage' > /usr/share/nginx/html/index.htmlResult Pass
After restored, we didn’t observe the error symptoms. longhorn-support-bundle_a334ca16-9b8c-4e6f-822e-711b32e42777_2022-08-29T06-15-10Z.zip .
Update current what I finding.
"echo 'Hello from Kubernetes storage' > /usr/share/nginx/html/index.html", and then ssh to the node and rundd if=/dev/urandom of=/dev/longhorn/bak status=progress. Thisddcommand would format the device so makes the/usr/share/nginx/html/index.htmlgone. So, it’s the reason why restore the volume the file/usr/share/nginx/html/index.htmlmissing.failed lock backupstore/volumes/17/22/bak/locks/lock-49c8c5b6feca4a36.lck type 2 acquisition.Failed to initiate the backup restore, will do revert and cleanup then.. Since longhorn-engine performs restoration per-replica, which means by default we have 3 replicas, each replica will run restoring simultaneously and separately. There might be a possible case thatreplica1 restoration -> deletion -> replica2 restoration ->replica3 restoration. It would cause the replica2 and replicas3 restoration to fail because acquires the restore lock error (type 1 lock).So I think Chris hits situation 1 and Kushboo hits situation 3. However, I haven’t reproduced situation 3 locally.
I think no. There is an issue related https://github.com/longhorn/longhorn/issues/3016 which is at version 1.1.2