velero: scheduled backups are failing 50% of the time connect: connection refused

What steps did you take and what happened: Let the schedule try to do a backup or run this command velero backup create xxxxx

What did you expect to happen: No backups with the status Failed

The following information will help us better understand what’s going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.) kubectl logs deployment/velero -n velero | grep level=error | grep velero-backup6h-20220120171346 time=“2022-01-20T17:15:44Z” level=error msg=“backup failed” controller=backup error=“[rpc error: code = Unavailable desc = transport is closing, rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin562653323: connect: connection refused"]” key=velero/velero-backup6h-20220120171346 logSource=“pkg/controller/backup_controller.go:281”

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version): 1.7.1
  • Velero features (use velero client config get features): features: <NOT SET>
  • Kubernetes version (use kubectl version): 1.20.12
  • Cloud provider or hardware configuration: Azure Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project’s top voted issues listed here.
Use the “reaction smiley face” up to the right of this comment to vote.

  • 👍 for “I would like to see this bug fixed as soon as possible”
  • 👎 for “There are more important bugs to focus on right now”

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (9 by maintainers)

Most upvoted comments

@blackpiglet It looks like you were right. The pod lacked resources. Due to the short duration of the loads, the monitoring system did not catch the actual consumption of resources in time. Increasing the resource request has added stability.