longhorn: [QUESTION] Create Longhorn backup failed

Question

Suggest to use https://github.com/longhorn/longhorn/discussions to ask questions.

Environment

  • Longhorn version: V1.3.0
  • Kubernetes version: v1.19.16
  • Node config
    • OS type and version: Ubuntu 20.04
    • CPU per node: 16
    • Memory per node: 32Gb
    • Disk type: SSD
    • Network bandwidth and latency between the nodes: 10Gbs
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Cloud

Additional context

When making a backup using Longhorn UI, it has failed after run about 3% with error like this: proxyServer=10.42.133.36:8501 destination=10.42.76.202:10001: failed to get backup-0fa2608f97b74fe7 backup status: rpc error: code = Unknown desc = failed to get backup-0fa2608f97b74fe7 backup status on unknown replica tcp://10.42.76.236:10015

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

longhorn-support-bundle@suse.com

support bundle

@tommy04062019 You can refer to https://www.suse.com/support/kb/doc/?id=000020145.

BTW, you can reproduce this issue first and then generate the support bundle. Then, we can check the log details.

@derekbit Thank you, I sent the support bundle to longhorn-support-bundle@suse.com , Pls have a look.

@derekbit it is easy to reproduce because the backup always failed when I make a backup.

  • The backup target is a S3 endpoint(MinIO)
  • I have two volumes that need to backup, one is about 2 GB of data and one is about 28 GB
  • Backup the volume 2GB completed without error, but when backup the volume 28GB, it failed as above One more thing I don’t understand is the volume is degraded each time I make a backup and it takes time to rebuild replica😞 Bellow is backup log:
time="2022-08-14T14:25:09Z" level=debug msg="Setting allow-recurring-job-while-volume-detached is false"
time="2022-08-14T14:25:09Z" level=debug msg="Get volumes from label recurring-job.longhorn.io/c-wlbo6w=enabled"
time="2022-08-14T14:25:09Z" level=info msg="Found 1 volumes with recurring job c-wlbo6w"
time="2022-08-14T14:25:09Z" level=info msg="Creating job" concurrent=1 groups= job=c-wlbo6w labels="{\"RecurringJob\":\"c-wlbo6w\"}" retain=1 task=backup volume=jenkins
time="2022-08-14T14:25:09Z" level=info msg="job starts running" jobType=backup labels="map[RecurringJob:c-wlbo6w]" namespace=longhorn-system retain=1 snapshotName=c-wlbo6w-65308d1d-1aa3-4f3d-807f-b05dc5f4c037 volumeName=jenkins
time="2022-08-14T14:25:09Z" level=info msg="Running recurring backup for volume jenkins" jobType=backup labels="map[RecurringJob:c-wlbo6w]" namespace=longhorn-system retain=1 snapshotName=c-wlbo6w-65308d1d-1aa3-4f3d-807f-b05dc5f4c037 volumeName=jenkins
time="2022-08-14T14:25:10Z" level=info msg="Created the snapshot c-wlbo6w-65308d1d-1aa3-4f3d-807f-b05dc5f4c037" jobType=backup labels="map[RecurringJob:c-wlbo6w]" namespace=longhorn-system retain=1 snapshotName=c-wlbo6w-65308d1d-1aa3-4f3d-807f-b05dc5f4c037 volumeName=jenkins
time="2022-08-14T14:25:10Z" level=error msg="failed to run job for volume" concurrent=1 error="failed to complete backupAndCleanup for jenkins: Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=fail to delete snapshot: proxyServer=10.42.76.202:8501 destination=10.42.76.202:10001: failed to remove snapshot [c-wlbo6w-c27c5f94-b035-48f1-8b5a-8296f8934953]: rpc error: code = Unknown desc = Can not remove a snapshot because tcp://10.42.76.236:10015 is rebuilding, code=Server Error] from [http://longhorn-backend:9500/v1/volumes/jenkins?action=snapshotDelete]" groups= job=c-wlbo6w labels="{\"RecurringJob\":\"c-wlbo6w\"}" retain=1 task=backup volume=jenkins