longhorn: [BUG] Self-Hosted Minio-Backupstorage - timeout during Backup
Describe the bug I have longhorn 1.1.0 on Rancher 2.5.1 and a self-hosted minio backupstorage. When I do a backup in Longhorn, it sometimes happens that minio displays a timeout error. The “Snapshots and Backups Longhorn” view shows that a backup was carried out. When I click on “Backup” in the Longhorn UI, the last backup is not available.
To Reproduce Manual or automatic backup to minio.
Expected behavior The backup is created.
Log docker logs -f <minio-container>
API: PutObject(bucket=k8s-cluster01, object=backupstore/volumes/79/dd/pvc-dca02b3d-8845-4e35-b4ba-7e004238d70d/blocks/2f/c1/2fc17d80430fbb443f3d6432f3d3565078acb49be1c6eff98a756888fffcc945.blk)
Time: 13:52:46 UTC 01/28/2021
DeploymentID: 60a01f5f-7567-48t6-a9f2-d86b7d8df3c6
RequestID: 165E6977AB3A804E
RemoteHost: XXX.XXX.XXX.XX
Host: minio.domain.de
UserAgent: aws-sdk-go/1.25.16 (go1.14.4; linux; amd64)
Error: Operation timed out (cmd.OperationTimedOut)
3: cmd/fs-v1.go:1100:cmd.(*FSObjects).PutObject()
2: cmd/object-handlers.go:1565:cmd.objectAPIHandlers.PutObjectHandler()
1: net/http/server.go:2042:http.HandlerFunc.ServeHTTP()
Environment:
- Longhorn version: 1.1.0
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
- Node config
- OS type and version: Ubuntu 20.04
- CPU per node: 32
- Memory per node: 256
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 1G
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 25
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 61 (34 by maintainers)
We’re getting the same error with a similar issue attempting to create disaster recovery volumes Glad to see a fix is incoming on 1.1.2, hopefully it resolves this adjacent issue
Edit: v1.1.2-rc1 works well at mitigating this issue in our dev environment
I tried 1.1.2-rc1 in our dev environment with pvc-dca02b3d … The cronjob will be completed successfully! Without timeout! Please only close the issue when the stable version of 1.1.2 is running in our productive environment.
The fix for this 10s HTTP timeout issue, as well as the large volume backup optimization (longhorn/longhorn-engine#626), will be released in v1.1.2. We can see if the fixes can help you after upgrade.
Possibly the culprit is the gRPC default 1m default? https://github.com/longhorn/longhorn-engine/blob/6f0b93f/pkg/replica/client/client.go#L469
https://github.com/longhorn/longhorn/issues/2218#issuecomment-775816985
[Updated - 06/16/2021] There are 2 issues that lead to backup creation of large volume timeouts.
CompareSnapshot()==> preload ==> findExtentsBased on my test, For a volume contains 600G data (4000+ extents), running function
preloadwill take around 1 minute. Without the channel, the time will be reduced to 10s.@lucky4ever2 Sorry for missing your comment.
This error is different from the issue we discussed above. This error you encountered is caused by multiple backup volume operations happening at the same time, e.g., simultaneous backup creation and deletion for the same volume. I think there are 2 common scenarios that can trigger this:
If you are neither of the above 2 cases and this error is often triggered. Please create a new Github ticket with the support bundle. I don’t think this is related to the timeout issue.
I installed 1.1.2 in our productive environment. The backups have been running for 2 days without any timeouts. This solves the problem, thank you very much for that!
After discussing with @innobead, we plan to use https://github.com/longhorn/longhorn/issues/2807 to track the following fix.
A note: We also need to address this problem https://github.com/longhorn/longhorn/issues/2785#issuecomment-881106608 as part of this issue.
Manual test for the 2nd cause in https://github.com/longhorn/longhorn/issues/2218#issuecomment-859611023 & PR longhorn/longhorn-engine#626:
ddto write data to the volume head file in the only replica directory.ddcan avoid generating too many extents in the volume)Do you mean extents? Typically we use a syscall in Golang to retrieve it.
@lucky4ever2 What is your volume size? I don’t retain your support bundle hence I don’t remember the detail… The combination of the short HTTP timeout and the above issue may lead to the scenario you encountered.
Bumping the hardcoded GRPC timeout to 2 minutes solved the problem in our case. But the correct value for the timeout may also depend on the size of volumes that are being backed up.
@lucky4ever2 Currently, I have found 2
3possible roots beside the above conclusion. I am not sure which one is the culprit:The readiness probe issue. #2590 (I am not sure if this can be the culprit)Unfortunately, there is no workaround for the 3 possible causes… To eliminate the timeout error, we also plan to refactor backup part #1761 so that every backup-related operation can be asynced. Hope that we can fix them in v1.2.0.
There will be fixups related to snapshot creation and backup list. Hope this would help solve this issue.
We are planning to replicate this problem internally and do additional investigation. We added pagination support for backups and are adding parallelization for the list object calls, in a follow on pr.
@khushboo-rancher @meldafrawi Once you have capacity can you try to reproduce this issue internally, a good time to reproduce this and test it internally would be while testing the pagination feature implementation. https://github.com/longhorn/longhorn/issues/1904
A good test scenario would be, do a couple of cycles of the below before starting the deletion phase
dd if=/dev/urandom of=/mnt/test-vol/dataDuring this whole process notice any failed operations / timeouts. If you manage to get a timeout during the backup creation call, leave the cluster for further investigation in that state.