velero: Non-existent items should not cause backup to be PartiallyFailed

xref #1858

During a backup, if we have, for example, a pod that references a non-existent PVC, the backup will be PartiallyFailed because the PVC doesn’t exist.

Arguably, the non-existence of the PVC should not cause the backup to fail. It’s not Velero’s job to validate the API resources, just to back them up as-is.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

We discussed this in today’s community call and came to a consensus that:

  • non-existent items should not cause a backup to fail or partially fail
  • they should be logged as warnings, so that there is some visibility around them (the backup CR gets updated with a warning count when the backup completes, and it’s displayed in the CLI)
  • it’d be nice to have velero backup describe --details automatically show warnings, similarly to how velero restore describe --details does. This is slightly more complicated than it may sound, because the backends for how backup/restore warnings are stored are different. There’s an issue logged to reconcile them (https://github.com/vmware-tanzu/velero/issues/286).

Some implementation notes:

@cblecker we did not resolve this, no.

I do personally agree that we should not flag these things as errors in Velero, but I’m not sure we had reached consensus as a team.

This would be a great item to bring up in the open discussion at our weekly community meeting, if you’re able to join!

If not, let’s continue discussing here and see if we can get to a resolution.

OK, no worries. Hope to see you back again as a contributor before too long!

@skriss I won’t have time in the short term to, unfortunately. If there is someone else with cycles, that would be great.

Another way we could approach this is rather than doing a blanket ignore of not-found errors when backing up additional items, we could add code to the “add PVC from pod” backup item action that checks if the PVC referred to by the pod actually exists; if it does, return it as an additional item; if not, don’t return any additional item to back up. (ref https://github.com/vmware-tanzu/velero/blob/master/pkg/backup/pod_action.go#L69)

The benefit of this is that it’s more context-aware – we know that in this specific scenario, a “not-found” error is not actually an error. The downside is we’d have to add logic like this to every plugin that could potentially have this issue (I believe some of the RBAC-related item actions have similar issues).

It’s intended to indicate that the user needs to look at the backup and decide if it’s usable or not - so more of a warning than a failure. FWIW, I originally wanted to call it CompletedWithWarnings but got overruled 😃

In this scenario, though, I really don’t see this as an error from the Velero perspective. If a pod references a non-existent PVC, and we back up the pod (including that reference), then not being able to back up a PVC that doesn’t exist isn’t an error in the backup - we backed up what was there, and there was no volume to back up. It is potentially an error in their configuration, but I don’t think that’s Velero’s job to detect?

@carlisia yes #1858 is related too - if we decide that this kind of scenario should result in PartiallyFailed, then the user wants to be able to restore it using the --from-schedule flag, which currently selects the most recent Completed backup. We could go at it from that side too - just didn’t want to add a new API field if we didn’t need to.