longhorn: [BUG] Self-Hosted Minio-Backupstorage - timeout during Backup

Describe the bug I have longhorn 1.1.0 on Rancher 2.5.1 and a self-hosted minio backupstorage. When I do a backup in Longhorn, it sometimes happens that minio displays a timeout error. The “Snapshots and Backups Longhorn” view shows that a backup was carried out. When I click on “Backup” in the Longhorn UI, the last backup is not available.

To Reproduce Manual or automatic backup to minio.

Expected behavior The backup is created.

Log docker logs -f <minio-container>

API: PutObject(bucket=k8s-cluster01, object=backupstore/volumes/79/dd/pvc-dca02b3d-8845-4e35-b4ba-7e004238d70d/blocks/2f/c1/2fc17d80430fbb443f3d6432f3d3565078acb49be1c6eff98a756888fffcc945.blk)
Time: 13:52:46 UTC 01/28/2021
DeploymentID: 60a01f5f-7567-48t6-a9f2-d86b7d8df3c6
RequestID: 165E6977AB3A804E
RemoteHost: XXX.XXX.XXX.XX
Host: minio.domain.de
UserAgent: aws-sdk-go/1.25.16 (go1.14.4; linux; amd64)
Error: Operation timed out (cmd.OperationTimedOut)
       3: cmd/fs-v1.go:1100:cmd.(*FSObjects).PutObject()
       2: cmd/object-handlers.go:1565:cmd.objectAPIHandlers.PutObjectHandler()
       1: net/http/server.go:2042:http.HandlerFunc.ServeHTTP()

Environment:

Longhorn version: 1.1.0
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3
Node config
- OS type and version: Ubuntu 20.04
- CPU per node: 32
- Memory per node: 256
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 1G
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 25

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 61 (34 by maintainers)

Most upvoted comments

We’re getting the same error with a similar issue attempting to create disaster recovery volumes Glad to see a fix is incoming on 1.1.2, hopefully it resolves this adjacent issue

Edit: v1.1.2-rc1 works well at mitigating this issue in our dev environment

ChipWolf on Jun 26, 2021

I tried 1.1.2-rc1 in our dev environment with pvc-dca02b3d … The cronjob will be completed successfully! Without timeout! Please only close the issue when the stable version of 1.1.2 is running in our productive environment.

lucky4ever2 on Jul 2, 2021

The fix for this 10s HTTP timeout issue, as well as the large volume backup optimization (longhorn/longhorn-engine#626), will be released in v1.1.2. We can see if the fixes can help you after upgrade.

shuo-wu on Jun 23, 2021

Possibly the culprit is the gRPC default 1m default? https://github.com/longhorn/longhorn-engine/blob/6f0b93f/pkg/replica/client/client.go#L469

jenting on May 26, 2021

@lucky4ever2 What is your volume size? I don’t retain your support bundle hence I don’t remember the detail… The combination of the short HTTP timeout and the above issue may lead to the scenario you encountered.

https://github.com/longhorn/longhorn/issues/2218#issuecomment-775816985

jenting on Jun 11, 2021

[Updated - 06/16/2021] There are 2 issues that lead to backup creation of large volume timeouts.

One root cause is that Longhorn needs to retrieve all extents of the volume before backup creation. This extent retrieval is time-consuming! It leads to unexpected slow long rebuilding issue #2507 as well. cc @joshimoo: Preparing delta backups ==> CompareSnapshot() ==> preload ==> findExtents

Another root cause is that, the channel may need to handle hundred millions of entries when there is a large amount of data in a volume. Then the processing speed will be degraded a lot.

Based on my test, For a volume contains 600G data (4000+ extents), running function preload will take around 1 minute. Without the channel, the time will be reduced to 10s.

shuo-wu on Jun 16, 2021

@lucky4ever2 Sorry for missing your comment.

This error is different from the issue we discussed above. This error you encountered is caused by multiple backup volume operations happening at the same time, e.g., simultaneous backup creation and deletion for the same volume. I think there are 2 common scenarios that can trigger this:

There are intensive data writes & deletion for the volume. The bandwidth is limited. And the volume recurring backup job executing intervals are relatively short. So that the previous backup operation is still in progress when the new operation comes.
The manual backup operations are issued when the recurring backup job for the volume is running.

If you are neither of the above 2 cases and this error is often triggered. Please create a new Github ticket with the support bundle. I don’t think this is related to the timeout issue.

shuo-wu on May 26, 2021

I installed 1.1.2 in our productive environment. The backups have been running for 2 days without any timeouts. This solves the problem, thank you very much for that!

lucky4ever2 on Jul 23, 2021

A note: We also need to address this problem #2785 (comment) as part of this issue.

After discussing with @innobead, we plan to use https://github.com/longhorn/longhorn/issues/2807 to track the following fix.

shuo-wu on Jul 19, 2021

A note: We also need to address this problem https://github.com/longhorn/longhorn/issues/2785#issuecomment-881106608 as part of this issue.

PhanLe1010 on Jul 17, 2021

Manual test for the 2nd cause in https://github.com/longhorn/longhorn/issues/2218#issuecomment-859611023 & PR longhorn/longhorn-engine#626:

Create a volume with size 1000Gi (we can use a single-replica volume to reduce the cost)
Write continuous 500~1000Gi data to the volume (we can use dd to write data to the volume head file in the only replica directory. dd can avoid generating too many extents in the volume)
Create a backup via UI. Without the fix, the backup creation call should take around 1 minute or timeout. After upgrade the volume to the engine image containing the fix, the call should return in 10~20 seconds.

shuo-wu on Jun 16, 2021

Do you mean extents? Typically we use a syscall in Golang to retrieve it.

shuo-wu on Jun 11, 2021

@lucky4ever2 What is your volume size? I don’t retain your support bundle hence I don’t remember the detail… The combination of the short HTTP timeout and the above issue may lead to the scenario you encountered.

shuo-wu on Jun 11, 2021

Bumping the hardcoded GRPC timeout to 2 minutes solved the problem in our case. But the correct value for the timeout may also depend on the size of volumes that are being backed up.

janeczku on Jun 11, 2021

@lucky4ever2 Currently, I have found 23 possible roots beside the above conclusion. I am not sure which one is the culprit:

The HTTP timeout config is too short: https://github.com/longhorn/longhorn-manager/pull/887
~~The readiness probe issue. #2590 (I am not sure if this can be the culprit)~~
Too many backups in a volume: https://github.com/longhorn/longhorn/issues/2543

Unfortunately, there is no workaround for the 3 possible causes… To eliminate the timeout error, we also plan to refactor backup part #1761 so that every backup-related operation can be asynced. Hope that we can fix them in v1.2.0.

shuo-wu on Jun 11, 2021

There will be fixups related to snapshot creation and backup list. Hope this would help solve this issue.

shuo-wu on Mar 8, 2021

We are planning to replicate this problem internally and do additional investigation. We added pagination support for backups and are adding parallelization for the list object calls, in a follow on pr.

joshimoo on Feb 24, 2021

@khushboo-rancher @meldafrawi Once you have capacity can you try to reproduce this issue internally, a good time to reproduce this and test it internally would be while testing the pagination feature implementation. https://github.com/longhorn/longhorn/issues/1904

A good test scenario would be, do a couple of cycles of the below before starting the deletion phase

create a 200GB volume.
write 200 gb of random data to the volume (either in filesystem or block mode, makes no difference for the backup process)
dd if=/dev/urandom of=/mnt/test-vol/data
take a backup
take another backup, this will be delta backup of size 0, but depend on the blocks of the previous backup
write 100 gb of random data to the volume
take a backup, this will add 100gb of new blocks to the backup
take another backup, this will be delta backup of size 0, but depend on the blocks of the previous backup
write 200gb of random data to the volume
take a backup, the backupstore should now be around ~ 500gb
you can repeat the above till 1 tb or start doing lots of smaller backups till you reach 1TB.
then start deleting backups via the ui at random

During this whole process notice any failed operations / timeouts. If you manage to get a timeout during the backup creation call, leave the cluster for further investigation in that state.

joshimoo on Feb 17, 2021