velero: Backup fails with transport is closing

What steps did you take and what happened:

We have a CI/CD job which takes a backup of the cluster and then restores from the backup. Almost half of the time backup ends up with this failure:

time="2019-09-06T13:52:30Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:529"
time="2019-09-06T13:52:30Z" level=error msg="backup failed" controller=backup error="[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>, rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>]" key=kyma-system/6f27c6d4-1c32-4c43-8e6b-55f213761efa logSource="pkg/controller/backup_controller.go:230"

Any idea why this happens and is there anything we can do to prevent this?

Anything else you would like to add:

Here is the backup file we use:

apiVersion: velero.io/v1
kind: Backup
metadata:
  name: kyma-backup
  namespace: kyma-system
spec:
  includedNamespaces:
  - '*'
  includedResources:
  - '*'
  includeClusterResources: true
  storageLocation: default
  volumeSnapshotLocations: 
  - default

We just deploy this file to the cluster using kubectl apply -f.

Environment:

  • Velero version (use velero version): 1.0.0
  • Kubernetes version (use kubectl version): 1.13.9-gke.3
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release):

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (10 by maintainers)

Most upvoted comments

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now… Thanks a lot for your fast reply!

@vmware-tanzu/velero-maintainers I’m guessing we should lower the value for this setting. I set it at 100MB since that’s the max Azure allows, which means Velero will create the minimum number of chunks, but I think it’s causing Velero to exceed its default limits regularly.

We could probably drop the chunk size down to something significantly smaller and it wouldn’t have much impact on most users since their backups will be way under 100MB; users with very large backups can tune it.

It looks stable now, yes. We’ll set up an alert on the consumption to be warned in the future.

On Mon, 23 Sep 2019 at 23.47, Adnan Abdulhussein notifications@github.com wrote:

Really appreciate the info @Crevil https://github.com/Crevil. Strange, your backups are smaller than this user’s https://github.com/heptio/velero/issues/94#issuecomment-514561793, though their resource usage was lower.

It’s difficult to come up with a baseline that works for everyone, our best recommendation would be to monitor resource usage and set appropriate reqs/limits for your environment. Has the Pod remained stable since removing the default reqs/limits?

@suleymanakbas91 https://github.com/suleymanakbas91 are you still experiencing this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/heptio/velero/issues/1856?email_source=notifications&email_token=ABUQDHVP2ZANIFEWLRFWQ6TQLE2OXA5CNFSM4IUZH2AKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ML6HY#issuecomment-534298399, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUQDHQXXTW7JEOPXO2UDBLQLE2OXANCNFSM4IUZH2AA .

– Hilsen Bjørn Sørensen Tlf.: (+45) 28447177

Went from 1 CPU/256Mi to 2 CPU and 1Gi -> works now… Thanks a lot for your fast reply!

This has solved my problem…

Velero 1.5.1 Azure plugin: 1.1.0

I would first try increasing the memory limit on the Velero deployment. There may be a couple of defaults that aren’t playing nice together. Let us know if that fixes things!

Same for me on Azure AKS with advanced networking.
Everything seems to proceed as expected (Backup on Storage Account, Snapshots), just the finalization seems to break.

velero time="2020-06-18T15:13:17Z" level=info msg="Backup completed" controller=backup logSource="pkg/controller/backup_controller.go:619"
velero time="2020-06-18T15:13:17Z" level=error msg="backup failed" controller=backup error="rpc error: code = Unavailable desc = transport is closing" key=velero/cpp-qa-test logSource="pkg/controller/backup_controller.go:273"

Just to be complete: Velero: 1.4.0 Azure-Plugin for Velero: 1.1.0

Closing this out as inactive. Feel free to reach out again as needed.

We were using limits as specified from the install command. Here is a shortened version of what we were running.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: velero
spec:
  template:
    spec:
      containers:
      - name: velero
        resources:
          limits:
            cpu: "1"
            memory: 256Mi
          requests:
            cpu: 500m
            memory: 128Mi

We are running an AWS on a self-managed k8s cluster. I inspected one of our backups and we have around 4600 resources and 12 volumes with EC2 Snapshots. We are not using restic.

Let me know if you need any thing else.