longhorn: [BUG] Self-Hosted Minio-Backupstorage - timeout during Backup

Describe the bug I have longhorn 1.1.0 on Rancher 2.5.1 and a self-hosted minio backupstorage. When I do a backup in Longhorn, it sometimes happens that minio displays a timeout error. The “Snapshots and Backups Longhorn” view shows that a backup was carried out. When I click on “Backup” in the Longhorn UI, the last backup is not available.

To Reproduce Manual or automatic backup to minio.

Expected behavior The backup is created.

Log docker logs -f <minio-container>

API: PutObject(bucket=k8s-cluster01, object=backupstore/volumes/79/dd/pvc-dca02b3d-8845-4e35-b4ba-7e004238d70d/blocks/2f/c1/2fc17d80430fbb443f3d6432f3d3565078acb49be1c6eff98a756888fffcc945.blk)
Time: 13:52:46 UTC 01/28/2021
DeploymentID: 60a01f5f-7567-48t6-a9f2-d86b7d8df3c6
RequestID: 165E6977AB3A804E
RemoteHost: XXX.XXX.XXX.XX
Host: minio.domain.de
UserAgent: aws-sdk-go/1.25.16 (go1.14.4; linux; amd64)
Error: Operation timed out (cmd.OperationTimedOut)
       3: cmd/fs-v1.go:1100:cmd.(*FSObjects).PutObject()
       2: cmd/object-handlers.go:1565:cmd.objectAPIHandlers.PutObjectHandler()
       1: net/http/server.go:2042:http.HandlerFunc.ServeHTTP()

Environment:

  • Longhorn version: 1.1.0
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: Ubuntu 20.04
    • CPU per node: 32
    • Memory per node: 256
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes: 1G
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 25

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 61 (34 by maintainers)

Most upvoted comments

We’re getting the same error with a similar issue attempting to create disaster recovery volumes Glad to see a fix is incoming on 1.1.2, hopefully it resolves this adjacent issue

Edit: v1.1.2-rc1 works well at mitigating this issue in our dev environment

I tried 1.1.2-rc1 in our dev environment with pvc-dca02b3d … The cronjob will be completed successfully! Without timeout! Please only close the issue when the stable version of 1.1.2 is running in our productive environment.

The fix for this 10s HTTP timeout issue, as well as the large volume backup optimization (longhorn/longhorn-engine#626), will be released in v1.1.2. We can see if the fixes can help you after upgrade.

@lucky4ever2 What is your volume size? I don’t retain your support bundle hence I don’t remember the detail… The combination of the short HTTP timeout and the above issue may lead to the scenario you encountered.

https://github.com/longhorn/longhorn/issues/2218#issuecomment-775816985

[Updated - 06/16/2021] There are 2 issues that lead to backup creation of large volume timeouts.


  1. One root cause is that Longhorn needs to retrieve all extents of the volume before backup creation. This extent retrieval is time-consuming! It leads to unexpected slow long rebuilding issue #2507 as well. cc @joshimoo: Preparing delta backups ==> CompareSnapshot() ==> preload ==> findExtents

  1. Another root cause is that, the channel may need to handle hundred millions of entries when there is a large amount of data in a volume. Then the processing speed will be degraded a lot.

Based on my test, For a volume contains 600G data (4000+ extents), running function preload will take around 1 minute. Without the channel, the time will be reduced to 10s.

@lucky4ever2 Sorry for missing your comment.

This error is different from the issue we discussed above. This error you encountered is caused by multiple backup volume operations happening at the same time, e.g., simultaneous backup creation and deletion for the same volume. I think there are 2 common scenarios that can trigger this:

  1. There are intensive data writes & deletion for the volume. The bandwidth is limited. And the volume recurring backup job executing intervals are relatively short. So that the previous backup operation is still in progress when the new operation comes.
  2. The manual backup operations are issued when the recurring backup job for the volume is running.

If you are neither of the above 2 cases and this error is often triggered. Please create a new Github ticket with the support bundle. I don’t think this is related to the timeout issue.

I installed 1.1.2 in our productive environment. The backups have been running for 2 days without any timeouts. This solves the problem, thank you very much for that!

A note: We also need to address this problem #2785 (comment) as part of this issue.

After discussing with @innobead, we plan to use https://github.com/longhorn/longhorn/issues/2807 to track the following fix.

A note: We also need to address this problem https://github.com/longhorn/longhorn/issues/2785#issuecomment-881106608 as part of this issue.

Manual test for the 2nd cause in https://github.com/longhorn/longhorn/issues/2218#issuecomment-859611023 & PR longhorn/longhorn-engine#626:

  1. Create a volume with size 1000Gi (we can use a single-replica volume to reduce the cost)
  2. Write continuous 500~1000Gi data to the volume (we can use dd to write data to the volume head file in the only replica directory. dd can avoid generating too many extents in the volume)
  3. Create a backup via UI. Without the fix, the backup creation call should take around 1 minute or timeout. After upgrade the volume to the engine image containing the fix, the call should return in 10~20 seconds.

Do you mean extents? Typically we use a syscall in Golang to retrieve it.

@lucky4ever2 What is your volume size? I don’t retain your support bundle hence I don’t remember the detail… The combination of the short HTTP timeout and the above issue may lead to the scenario you encountered.

Bumping the hardcoded GRPC timeout to 2 minutes solved the problem in our case. But the correct value for the timeout may also depend on the size of volumes that are being backed up.

@lucky4ever2 Currently, I have found 23 possible roots beside the above conclusion. I am not sure which one is the culprit:

  1. The HTTP timeout config is too short: https://github.com/longhorn/longhorn-manager/pull/887
  2. The readiness probe issue. #2590 (I am not sure if this can be the culprit)
  3. Too many backups in a volume: https://github.com/longhorn/longhorn/issues/2543

Unfortunately, there is no workaround for the 3 possible causes… To eliminate the timeout error, we also plan to refactor backup part #1761 so that every backup-related operation can be asynced. Hope that we can fix them in v1.2.0.

There will be fixups related to snapshot creation and backup list. Hope this would help solve this issue.

We are planning to replicate this problem internally and do additional investigation. We added pagination support for backups and are adding parallelization for the list object calls, in a follow on pr.

@khushboo-rancher @meldafrawi Once you have capacity can you try to reproduce this issue internally, a good time to reproduce this and test it internally would be while testing the pagination feature implementation. https://github.com/longhorn/longhorn/issues/1904

A good test scenario would be, do a couple of cycles of the below before starting the deletion phase

  • create a 200GB volume.
  • write 200 gb of random data to the volume (either in filesystem or block mode, makes no difference for the backup process)
  • dd if=/dev/urandom of=/mnt/test-vol/data
  • take a backup
  • take another backup, this will be delta backup of size 0, but depend on the blocks of the previous backup
  • write 100 gb of random data to the volume
  • take a backup, this will add 100gb of new blocks to the backup
  • take another backup, this will be delta backup of size 0, but depend on the blocks of the previous backup
  • write 200gb of random data to the volume
  • take a backup, the backupstore should now be around ~ 500gb
  • you can repeat the above till 1 tb or start doing lots of smaller backups till you reach 1TB.
  • then start deleting backups via the ui at random

During this whole process notice any failed operations / timeouts. If you manage to get a timeout during the backup creation call, leave the cluster for further investigation in that state.