velero: Backup sometimes fails with 'TLS handshake timeout'

What steps did you take and what happened:

Backup tasks sometimes fail with ‘TLS handshake timeout’ when trying to reach the Kubernetes controller.

What did you expect to happen:

Velero should wait for the Kubernetes controller to be reachable again.

The following information will help us better understand what’s going on:

log output:

time=“2023-01-02T07:32:46Z” level=error msg=“backup failed” controller=backup error=“rpc error: code = Unknown desc = Get “[https://10.0.0.1:443/api\](https://10.0.0.1/api\)”: net/http: TLS handshake timeout” key=velero/daily0 logSource=“pkg/controller/backup_controller.go:298”

Anything else you would like to add:

Maybe there is a retry loop missing?

Environment:

  • Velero version (use velero version): 1.10.0
  • Velero features (use velero client config get features): EnableCSI
  • Kubernetes version (use kubectl version): 1.24.6
  • Kubernetes installer & version: Azure Kubernetes Service with Terraform
  • Cloud provider or hardware configuration: Azure
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.6 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project’s top voted issues listed here.
Use the “reaction smiley face” up to the right of this comment to vote.

  • 👍 for “I would like to see this bug fixed as soon as possible”
  • 👎 for “There are more important bugs to focus on right now”

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 9
  • Comments: 15 (1 by maintainers)

Most upvoted comments

We are running a couple of aks clusters, all experiencing the same … regularly failed backups due of timeouts. Changing the schedule to pin each cluster to a separate time frame did not solve the issue so far. We are wondering if it would be possible to integrate a retry within velero for such cases.

Also seeing this issue regularly and couldn’t pin-point a culprit so far.

AKS 1.26.3 / 1.27.3, Velero 1.11.1, Azure Plugin 1.8.1

Out of curiosity:

  • Is everyone experiencing this running AKS? Or is someone experiencing this regularly and not using AKS?
  • Are your schedules always to the “full hour” or some odd number?

I’m also having this problem

time=“2023-05-22T02:00:44Z” level=error msg=“backup failed” controller=backup error=“rpc error: code = Unknown desc = Get "https://10.0.0.1:443/api?timeout=32s": net/http: TLS handshake timeout” key=velero/generalbackup01backup01-20230522020033 logSource=“pkg/controller/backup_controller.go:282”

Yes of course, every operation needs to time out at some point. What I miss is Velero to try again if this happens - or any other error occurs.

I’m not using a proxy. Resources should be sufficient: I had OOM kills in the past but they vanished, since I configured a memory request of 512 MiB for the pod.