azure-sdk-for-go: `containerregistry` `2021-08-01-preview`: AgentPool's `Delete` method returns too early
When trying to delete a containerregistry’s agentpool, the API call returns completion too early, underlying VMSS ip configuration still stays for a few.
Trying to implement support for it in azure provider here.
Destruction during acceptance test fails with:
------- Stdout: -------
=== RUN TestAccContainerRegistryAgentPool_complete
=== PAUSE TestAccContainerRegistryAgentPool_complete
=== CONT TestAccContainerRegistryAgentPool_complete
testing_new.go:70: Error running post-test destroy, there may be dangling resources: exit status 1
Error: deleting Subnet: (Name "internal" / Virtual Network Name "acctestvirtnet210713115141799144" / Resource Group "acctestRG-acragent_pool-210713115141799144"): network.SubnetsClient#Delete: Failure sending request: StatusCode=0 --
Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet internal is in use by /subscriptions/f9954d5e-6e81-4951-b504-56c712bcfe31/resourceGroups/ap-testacccr210713115141799144-ap210713115141744-394/providers/Microsoft.Network/networkInterfaces/|providers|Microsoft.Compute|virtualMachineScaleSets|testacccr210713115141799144-ap210713115141744-394-0|virtualMachines|0|networkInterfaces|ap-testacccr210713115141799144-ap210713115141744-394-0-nic/ipConfigurations/ap-testacccr210713115141799144-ap210713115141744-394-0-ipconfig and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]
--- FAIL: TestAccContainerRegistryAgentPool_complete (750.65s)
Deleting subnet from portal after 3-5 minutes succeeds without any issues.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 53 (30 by maintainers)
@magodo so good news, we were able to track your issue down on our side and we’re working on a fix
@magodo Thanks for the repro, I should be able to take a deeper dive,
Also a status update from our side that should be able to address some of these issues, cc: @favoretti
We’re in the process of rolling out a new operation status api to our resource provider that internally should allow for a better polling experience. Last week we merged the change and so now it’s just a matter of enabling things in production. This process will take a bit of time to roll out but I’m optimistic we should be able to wrap things up in about a month or two.
@juliusl
So step one would be to fork/clone the azurerm provider. Then set up the following env vars:
Then check out PR code from https://github.com/hashicorp/terraform-provider-azurerm/pull/14790 and run
@favoretti That would be much appreciated thanks 🙏
@favoretti got it, so that’s probably why jhendrix might’ve not run into the error.
Well in any case, from my side I’m currently in the process of adding a new api to our RP for the operation result polling. It will take a bit to deploy, but it should fix the 404. I’m crossing my fingers that it will address your issue as well.
For now, have you considered adding a 5-min grace period whenever that code runs into an error deleting the subnet? I know it’s a hack but it should unblock you for now.
I’ll keep updating this issue with my status. Just so you are aware, service changes typically take a while to make it downstream.
Setting env var
AZURE_GO_SDK_LOG_LEVELtodebugis the best way to debug LROs as you can see all the HTTP requests and responses.@juliusl it looks to me that the behavior for delete here is non-conformant.
Here’s the last successful poll on the URL from the Location header (I’ve omitted some content for brevity).
Note that the LRO status is non-terminal
Deleting. Now we poll again.Looks like the polling URL has been deleted. If the LRO has reached a successful terminal state, it should either return a 200 with a successful terminal state (e.g.
Succeeded) or a 204.@ArcturusZhang Yep, I am calling
WaitForCompletionRefindeed: https://github.com/hashicorp/terraform-provider-azurerm/pull/14790/files#diff-f6764132d0fc067d88469fea293e2010a129184291d121243d34695fe39c2e77R252This function returns a future which indicates this is a long-running operation. If an operation is a LRO, the caller should call its
WaitForCompletionRefmethod in which the SDK will do the polling until done as python SDK is doing.The polling logic is not implemented in the SDK but in another library
go-autorestwhich is why you did not see the polling logic in the SDK (it is here if you are interested).The reason why the Responder of this operation is not called is by-design. This operation does not have a response, therefore the SDK does not use the Responder, because it is not needed. If an LRO is returning a schema (like PUT), the responder will be invoked (example here)
The function
WaitForCompletionRefwill not return unless it gets thesuccessfrom the LRO polling. Therefore in this case, if you could confirm you calledWaitForCompletionRefin your code @favoretti , the root cause is not in the SDK. The service should not return success in an LRO if not everything is done. This happens occasionally in some APIs, their LRO might return success when they are done things in their own RP, and things from other RPs are still in progress (like in this case, the subnet from NRP is detaching but not finished, the API returns success) @juliusl please double confirm this.@ArcturusZhang Hello, I am from the ACR Service team, I took a look at the sdk code and I noticed a couple of things.
First for clarification, I notice that the client code has a Responder but does not use it in the either the Create/Delete methods, is this intentional?
Second, in ARM Delete is an async operation, meaning the client code needs to poll for the operation status before returning. This is not handled on the service side, so the sdk needs to do this here:
https://github.com/Azure/azure-sdk-for-go/blob/6264636169251df0ea48912167b417c2a9c108ce/services/preview/containerregistry/mgmt/2021-08-01-preview/containerregistry/agentpools.go#L168-L174
before returning to the caller.
For example, you can see in the python sdk we do the polling here:
https://github.com/Azure/azure-cli/blob/dev/src/azure-cli/azure/cli/command_modules/acr/agentpool.py#L99
How are changes like this handled in the go sdk?
Hi @toddysm do we have some updates on this?