velero: Removal of expired backups does not work

What steps did you take and what happened:

In our AWS-based setup, when scheduled backup reach their TTL, the deletion process is started but gets stuck in status Deleting. The contents in S3 for the bucket are properly deleted, while volume snapshots stay (causing significant extra cost).

What did you expect to happen:

I expect backups to be cleanly removed when their TTL expires, including all backed up data, such as volume snapshots.

The output of the following commands will help us better understand what’s going on: (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
time="2020-11-18T08:23:47Z" level=info msg="Removing existing deletion requests for backup" backup=velero-nightly-backup-20201103034355 controller=backup-deletion logSource="pkg/controller/backup_deletion_controller.go:469" name=velero-nightly-backup-20201103034355-gt6b9 namespace=velero
time="2020-11-18T08:23:50Z" level=error msg="Error in syncHandler, re-adding item to queue" controller=backup-deletion error="error downloading backup: error copying Backup to temp file: rpc error: code = Unknown desc = error getting object backups/velero-nightly-backup-20201103034355/velero-nightly-backup-20201103034355.tar.gz: NoSuchKey: The specified key does not exist.\n\tstatus code: 404, request id: 01DBEB5FABBF40BD, host id: HKe3B0heM0NpUhxXbLEZp7THCXtsfDKJkYdR6Sg0bS3+j0ywshitElmEnG7mPdDNmq6ASEtKT6w=" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/controller/restore_controller.go:558" error.function=github.com/vmware-tanzu/velero/pkg/controller.downloadToTempFile key=velero/velero-nightly-backup-20201103034355-gt6b9 logSource="pkg/controller/generic_controller.go:140"
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
Name:         velero-nightly-backup-20201103034355
Namespace:    velero
Labels:       app.kubernetes.io/instance=velero
              app.kubernetes.io/managed-by=Tiller
              app.kubernetes.io/name=velero
              helm.sh/chart=velero-2.0.3
              velero.io/schedule-name=velero-nightly-backup
              velero.io/storage-location=aws
Annotations:  <none>

Phase:  Deleting

Errors:    0
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  aws

Velero-Native Snapshot PVs:  auto

TTL:  360h0m0s

Hooks:  <none>

Backup Format Version:

Started:    2020-11-03 04:43:55 +0100 CET
Completed:  2020-11-03 04:52:17 +0100 CET

Expiration:  2020-11-18 04:43:55 +0100 CET

Velero-Native Snapshots:  2 of 2 snapshots completed successfully (specify --details for more information)

Deletion Attempts:
  2020-11-18 06:30:25 +0100 CET: InProgress
  • velero backup logs <backupname>
Logs for backup "velero-nightly-backup-20201103034355" are not available until it's finished processing. Please wait until the backup has a phase of Completed or Failed and try again.

Anything else you would like to add:

My guess is that this at least loosely related to https://github.com/vmware-tanzu/velero/pull/2993.

Environment:

Velero version:

Client:
	Version: v1.5.2
	Git commit: -
Server:
	Version: v1.5.2

Velero features:

features: <NOT SET>

Kubernetes version: 1.18.9

Kubernetes installer & version: kops 1.18.1

Cloud provider or hardware configuration: AWS (with aws plugin)

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project’s top voted issues listed here.
Use the “reaction smiley face” up to the right of this comment to vote.

  • 👍 for “I would like to see this bug fixed as soon as possible”
  • 👎 for “There are more important bugs to focus on right now”

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 9
  • Comments: 18 (9 by maintainers)

Most upvoted comments

please add velero backup delete --force parameter. I do not think kubeclt delete backup is a good idea.

@billimek For me it was the only way to clean this up, and I did not encounter any problems afterwards so far. But still, it’s just guessing 😉

I think that I’m experiencing this as well. Is it ‘safe’ to manually delete the backups.velero.io objects that seem to be ‘stuck’ deleting (e.g. k delete backups.velero.io -n velero velero-daily-backup-20201212060042)

image