velero: Backup fails with transport is closing
What steps did you take and what happened:
We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:
time="2019-09-06T13:52:30Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:529"
time="2019-09-06T13:52:30Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>]" key=kyma-system/6f27c6d4-1c32-4c43-8e6b-55f213761efa logSource="pkg/controller/backup_controller.go:230"
Any idea why this happens and is there anything we can do to prevent this?
Anything else you would like to add:
Here is the backup file we use:
apiVersion: velero.io/v1
kind: Backup
metadata:
name: kyma-backup
namespace: kyma-system
spec:
includedNamespaces:
- '*'
includedResources:
- '*'
includeClusterResources: true
storageLocation: default
volumeSnapshotLocations:
- default
We just deploy this file to the cluster using kubectl apply -f
.
Environment:
- Velero version (use
velero version
): 1.0.0 - Kubernetes version (use
kubectl version
): 1.13.9-gke.3 - Kubernetes installer & version:
- Cloud provider or hardware configuration: GKE
- OS (e.g. from
/etc/os-release
):
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 22 (10 by maintainers)
Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now… Thanks a lot for your fast reply!
@vmware-tanzu/velero-maintainers I’m guessing we should lower the value for this setting. I set it at 100MB since that’s the max Azure allows, which means Velero will create the minimum number of chunks, but I think it’s causing Velero to exceed its default limits regularly.
We could probably drop the chunk size down to something significantly smaller and it wouldn’t have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it.
It looks stable now, yes. We’ll set up an alert on the consumption to be warned in the future.
On Mon, 23 Sep 2019 at 23.47, Adnan Abdulhussein notifications@github.com wrote:
– Hilsen Bjørn Sørensen Tlf.: (+45) 28447177
This has solved my problem…
Velero 1.5.1 Azure plugin: 1.1.0
I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren’t playing nice together. Let us know if that fixes things!
Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.
Just to be complete: Velero: 1.4.0 Azure-Plugin for Velero: 1.1.0
Closing this out as inactive. Feel free to reach out again as needed.
We were using limits as specified from the
install
command. Here is a shortened version of what we were running.We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots. We are not using restic.
Let me know if you need any thing else.