velero: Race condition when passing AdditionalItems from restore item action

What steps did you take and what happened: We have a restore plugin which returns resources via AdditionalItems when those items must be restored before the current item. This happens when there is a dependency relationship between two items of the same resource type. In our particular case, for the imagestreamtags.image.openshift.io resource, in some cases some imagestream tags reference other imagestreamtags, and a restore for imageStreamTagA will fail if it references imageStreamTagB which does not exist. The plugin returns imageStreamTagB in the AdditionalItems slice.

In Restore.go, after running Execute on the plugin, for any AdditionalItems returned, restoreItem is called on each one.

From looking at the logs, everything seems to be happening in the right order. The plugin action runs for the resource that references the other imagestream tag, it passes the AdditionalItem reference, and then restore is called on that resource, including calling its restore plugins. However, when I look at the target namespace in the cluster after restore, I see the returned AdditionalItem resource, but I am not seeing the first resource restored at all. I know that if we attempt to restore an imagestreamtag with a non-existent reference, it will fail to restore, which is the expected behavior (i.e. if we have an alias tag for alpine:3.x which is supposed to track alpine 3.2, if 3.2 does not exist, the tracking tag will not be restored).

In this case, however, what seems to be happening is that restoreItem for the referenced resource returns before the resource is fully created in the cluster, so when the resourceClient.Create call is made, the referenced resource does not yet exist, according to the cluster.

I just modified my local velero checkout to add time.Sleep(10 * time.Second) immediately following the restoreItem call (linked below), and everything works as expected. https://github.com/heptio/velero/blob/master/pkg/restore/restore.go#L938

What did you expect to happen: My expectation was that both of the resources would be restored correctly.

The output of the following commands will help us better understand what’s going on: (Pasting long output into a GitHub gist or other pastebin is fine.)

(I don’t really think anything in my logs will give any more useful information than is already in the description above)

Anything else you would like to add: I’m not sure exactly how to resolve this. If there was some “after restore plugin action”, then for cases like this, I could have a plugin that blocked until the resource appeared in the cluster. In this case, on the returned AdditionalItems resource, the after restore plugin would wait until it was created to return, that way the initial imagestream tag which references it wouldn’t actually have its resourceClient.Create call made until the cluster initialization of the resource was done.

Environment:

Velero version (use velero version): Server version is a build off the master branch
Kubernetes version (use kubectl version): 1.12+
Kubernetes installer & version:
Cloud provider or hardware configuration: openshift 4 on aws
OS (e.g. from /etc/os-release):

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 16 (8 by maintainers)

Most upvoted comments

I’ve created a PR for a design for this here: https://github.com/vmware-tanzu/velero/pull/2867

sseago on Aug 26, 2020

@nrb @sseago I’m pretty sure we can close this out, now that we’ve closed #964 via #1937 - I’m going to close, but please reopen if you disagree with my assessment.

skriss on Feb 13, 2020