azure-sdk-for-go: `containerregistry` `2021-08-01-preview`: AgentPool's `Delete` method returns too early

When trying to delete a containerregistry’s agentpool, the API call returns completion too early, underlying VMSS ip configuration still stays for a few.

Trying to implement support for it in azure provider here.

Destruction during acceptance test fails with:

------- Stdout: -------
=== RUN   TestAccContainerRegistryAgentPool_complete
=== PAUSE TestAccContainerRegistryAgentPool_complete
=== CONT  TestAccContainerRegistryAgentPool_complete
testing_new.go:70: Error running post-test destroy, there may be dangling resources: exit status 1
Error: deleting Subnet: (Name "internal" / Virtual Network Name "acctestvirtnet210713115141799144" / Resource Group "acctestRG-acragent_pool-210713115141799144"): network.SubnetsClient#Delete: Failure sending request: StatusCode=0 -- 
Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet internal is in use by /subscriptions/f9954d5e-6e81-4951-b504-56c712bcfe31/resourceGroups/ap-testacccr210713115141799144-ap210713115141744-394/providers/Microsoft.Network/networkInterfaces/|providers|Microsoft.Compute|virtualMachineScaleSets|testacccr210713115141799144-ap210713115141744-394-0|virtualMachines|0|networkInterfaces|ap-testacccr210713115141799144-ap210713115141744-394-0-nic/ipConfigurations/ap-testacccr210713115141799144-ap210713115141744-394-0-ipconfig and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]
--- FAIL: TestAccContainerRegistryAgentPool_complete (750.65s)

Deleting subnet from portal after 3-5 minutes succeeds without any issues.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 53 (30 by maintainers)

Most upvoted comments

@magodo so good news, we were able to track your issue down on our side and we’re working on a fix

@magodo Thanks for the repro, I should be able to take a deeper dive,

Also a status update from our side that should be able to address some of these issues, cc: @favoretti

We’re in the process of rolling out a new operation status api to our resource provider that internally should allow for a better polling experience. Last week we merged the change and so now it’s just a matter of enabling things in production. This process will take a bit of time to roll out but I’m optimistic we should be able to wrap things up in about a month or two.

@juliusl

So step one would be to fork/clone the azurerm provider. Then set up the following env vars:

export ARM_CLIENT_ID="xxx"
export ARM_CLIENT_SECRET="xxx"
export ARM_CLIENT_ID_ALT="xxx" 
export ARM_CLIENT_SECRET_ALT="xxx"
export ARM_SUBSCRIPTION_ID="xxx" # sub where acceptance tests will create resources
export ARM_TENANT_ID="xxx"
export ARM_ENVIRONMENT="public"
export ARM_TEST_LOCATION="eastus2"
export ARM_TEST_LOCATION_ALT="centralus"
export ARM_TEST_LOCATION_ALT2="northcentralus"

Then check out PR code from https://github.com/hashicorp/terraform-provider-azurerm/pull/14790 and run

TF_ACC=1 go test -v ./azurerm/internal/services/containers -timeout=1000m -run=TestAccContainerRegistryAgentPool_complete

@favoretti That would be much appreciated thanks 🙏

@favoretti got it, so that’s probably why jhendrix might’ve not run into the error.

Well in any case, from my side I’m currently in the process of adding a new api to our RP for the operation result polling. It will take a bit to deploy, but it should fix the 404. I’m crossing my fingers that it will address your issue as well.

For now, have you considered adding a 5-min grace period whenever that code runs into an error deleting the subnet? I know it’s a hack but it should unblock you for now.

I’ll keep updating this issue with my status. Just so you are aware, service changes typically take a while to make it downstream.

Setting env var AZURE_GO_SDK_LOG_LEVEL to debug is the best way to debug LROs as you can see all the HTTP requests and responses.

@juliusl it looks to me that the behavior for delete here is non-conformant.

Here’s the last successful poll on the URL from the Location header (I’ve omitted some content for brevity).

(2022-01-13T14:54:38.9749705-08:00) INFO: REQUEST: GET https://management.azure.com/subscriptions/6d3860f6-8a11-431d-b3fa-1b3c4a8b888a/resourceGroups/jhendrixtest1/providers/Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1/operationResults/92110f9b-3ad7-4132-a6ae-164c77a8e178?api-version=2019-06-01-preview
User-Agent: Go/go1.17 (amd64-windows) go-autorest/v14.2.1 Azure-SDK-For-Go/v61.1.0 containerregistry/2021-08-01-preview
Authorization: **REDACTED**
(2022-01-13T14:54:39.0253957-08:00) INFO: RESPONSE: 202 https://management.azure.com/subscriptions/6d3860f6-8a11-431d-b3fa-1b3c4a8b888a/resourceGroups/jhendrixtest1/providers/Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1/operationResults/92110f9b-3ad7-4132-a6ae-164c77a8e178?api-version=2019-06-01-preview
Retry-After: 10
Location: https://management.azure.com/subscriptions/6d3860f6-8a11-431d-b3fa-1b3c4a8b888a/resourceGroups/jhendrixtest1/providers/Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1/operationResults/92110f9b-3ad7-4132-a6ae-164c77a8e178?api-version=2019-06-01-preview
{"status":"Deleting"}
(2022-01-13T14:54:39.0260859-08:00) INFO: DelayForBackoffWithCap: sleeping for 10s

Note that the LRO status is non-terminal Deleting. Now we poll again.

(2022-01-13T14:54:49.0310997-08:00) INFO: REQUEST: GET https://management.azure.com/subscriptions/6d3860f6-8a11-431d-b3fa-1b3c4a8b888a/resourceGroups/jhendrixtest1/providers/Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1/operationResults/92110f9b-3ad7-4132-a6ae-164c77a8e178?api-version=2019-06-01-preview
User-Agent: Go/go1.17 (amd64-windows) go-autorest/v14.2.1 Azure-SDK-For-Go/v61.1.0 containerregistry/2021-08-01-preview
Authorization: **REDACTED**
(2022-01-13T14:54:49.0705737-08:00) INFO: RESPONSE: 404 https://management.azure.com/subscriptions/6d3860f6-8a11-431d-b3fa-1b3c4a8b888a/resourceGroups/jhendrixtest1/providers/Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1/operationResults/92110f9b-3ad7-4132-a6ae-164c77a8e178?api-version=2019-06-01-preview
{"error":{"code":"ResourceNotFound","message":"The Resource 'Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1' under resource group 'jhendrixtest1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"}}
(2022-01-13T14:54:49.0710431-08:00) ERROR: WaitForCompletionRef: Code="ResourceNotFound" Message="The Resource 'Microsoft.ContainerRegistry/registries/jhendrixacr2/agentPools/jhendrixpool1' under resource group 'jhendrixtest1' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"
(2022-01-13T14:54:49.0717104-08:00) INFO: DelayForBackoffWithCap: sleeping for 30s
...

Looks like the polling URL has been deleted. If the LRO has reached a successful terminal state, it should either return a 200 with a successful terminal state (e.g. Succeeded) or a 204.

@ArcturusZhang Hello, I am from the ACR Service team, I took a look at the sdk code and I noticed a couple of things.

First for clarification, I notice that the client code has a Responder but does not use it in the either the Create/Delete methods, is this intentional?

Second, in ARM Delete is an async operation, meaning the client code needs to poll for the operation status before returning. This is not handled on the service side, so the sdk needs to do this here:

https://github.com/Azure/azure-sdk-for-go/blob/6264636169251df0ea48912167b417c2a9c108ce/services/preview/containerregistry/mgmt/2021-08-01-preview/containerregistry/agentpools.go#L168-L174

before returning to the caller.

For example, you can see in the python sdk we do the polling here:

https://github.com/Azure/azure-cli/blob/dev/src/azure-cli/azure/cli/command_modules/acr/agentpool.py#L99

How are changes like this handled in the go sdk?

This function returns a future which indicates this is a long-running operation. If an operation is a LRO, the caller should call its WaitForCompletionRef method in which the SDK will do the polling until done as python SDK is doing.

The polling logic is not implemented in the SDK but in another library go-autorest which is why you did not see the polling logic in the SDK (it is here if you are interested).

The reason why the Responder of this operation is not called is by-design. This operation does not have a response, therefore the SDK does not use the Responder, because it is not needed. If an LRO is returning a schema (like PUT), the responder will be invoked (example here)

The function WaitForCompletionRef will not return unless it gets the success from the LRO polling. Therefore in this case, if you could confirm you called WaitForCompletionRef in your code @favoretti , the root cause is not in the SDK. The service should not return success in an LRO if not everything is done. This happens occasionally in some APIs, their LRO might return success when they are done things in their own RP, and things from other RPs are still in progress (like in this case, the subnet from NRP is detaching but not finished, the API returns success) @juliusl please double confirm this.

@ArcturusZhang Hello, I am from the ACR Service team, I took a look at the sdk code and I noticed a couple of things.

First for clarification, I notice that the client code has a Responder but does not use it in the either the Create/Delete methods, is this intentional?

Second, in ARM Delete is an async operation, meaning the client code needs to poll for the operation status before returning. This is not handled on the service side, so the sdk needs to do this here:

https://github.com/Azure/azure-sdk-for-go/blob/6264636169251df0ea48912167b417c2a9c108ce/services/preview/containerregistry/mgmt/2021-08-01-preview/containerregistry/agentpools.go#L168-L174

before returning to the caller.

For example, you can see in the python sdk we do the polling here:

https://github.com/Azure/azure-cli/blob/dev/src/azure-cli/azure/cli/command_modules/acr/agentpool.py#L99

How are changes like this handled in the go sdk?

Hi @toddysm do we have some updates on this?