terraform-provider-azurerm: azurerm_role_definition WaitForState pending state not set correctly

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform v0.14.6

provider registry.terraform.io/hashicorp/azurerm v2.36.0

Affected Resource(s)

azurerm_role_definition

Terraform Configuration Files

resource "azurerm_role_definition" "example" {
  name        = "my-custom-role"
  scope       = "ROOT MANAGEMENT GROUP"
  description = "This is a custom role created via Terraform"

  permissions {
    actions     = ["*"]
    not_actions = []
  }

  assignable_scopes = [
    "ROOT MANAGEMENT GROUP"
  ]
}

Debug Output

Error: Provider produced inconsistent result after apply

When applying changes to
azurerm_role_definition.custom_roles["workload_contributor.yml"], provider
"registry.terraform.io/hashicorp/azurerm" produced an unexpected new value:
Root resource was present, but now absent.

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Steps to Reproduce

terraform apply

Important Factoids

It seems the WaitForState() function is called with the wrong Pending state property, as the API return 404 if the role is not found YET.

Currently:

	if !d.IsNewResource() {
		id, err := parse.RoleDefinitionId(d.Id())
		if err != nil {
			return err
		}
		stateConf := &resource.StateChangeConf{
			Pending: []string{
				"Pending",
			},
			Target: []string{
				"OK",
			},
			Refresh:                   roleDefinitionUpdateStateRefreshFunc(ctx, client, id.ResourceID),
			MinTimeout:                10 * time.Second,
			ContinuousTargetOccurence: 6,
			Timeout:                   d.Timeout(schema.TimeoutUpdate),
		}

		if _, err := stateConf.WaitForState(); err != nil {
			return fmt.Errorf("waiting for update to Role Definition %q to finish replicating", name)
		}
	}

Should be:

	if !d.IsNewResource() {
		id, err := parse.RoleDefinitionId(d.Id())
		if err != nil {
			return err
		}
		stateConf := &resource.StateChangeConf{
			Pending: []string{
				"NotFound",
			},
			Target: []string{
				"OK",
			},
			Refresh:                   roleDefinitionUpdateStateRefreshFunc(ctx, client, id.ResourceID),
			MinTimeout:                10 * time.Second,
			ContinuousTargetOccurence: 6,
			Timeout:                   d.Timeout(schema.TimeoutUpdate),
		}

		if _, err := stateConf.WaitForState(); err != nil {
			return fmt.Errorf("waiting for update to Role Definition %q to finish replicating", name)
		}
	}

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 36
Comments: 25 (5 by maintainers)

Most upvoted comments

This is breaking automation for a lot of people. Personally I have Terraform which should be able to automatically run on a daily basis that I now have to manually intervene with every time. Can we get some transparency on a roadmap or timeline on when this issue will be addressed? Coming over from #10442 it seems like this issue is not being properly recognized as a massively disruptive part of many peoples work days.

+18

grounded042 on Feb 18, 2021

In order to improve call latencies we had introduced distributed caching of Role Definitions. As a result of this change some customers experienced failures when a role definition is read immediately after writing. The root cause of the failure was that the read request was served from cache which was not updated immediately after the write. We have disabled the distributed caching of Role Definitions which has fixed the failures.

The root cause has been well understood and we will take extreme care in the future to prevent repeat occurrences of similar failures for this scenario and other scenarios.

There is still a small probability that read after write scenarios can experience this failure since we cache Role Definitions in-memory as well. The in-memory cache is not new and we have had it for a long time. So the probability of this issue happening again is the same as it was before we added the distributed cache.

Thank you all for the feedback which helps make our services better.

varunkch on Mar 17, 2021

This issue is expected to have been fixed in the service backend, which does not require Terraform changes. People who hit this issue please have a try.

mybayern1974 on Mar 3, 2021

I am another customer whose deploys have all started working reliably again for a few days suggesting this has been fixed.

alastairtree on Mar 5, 2021

Hi all, I posted my findings here after seeing things became stable in my local Terraform project involving the role definition story, with comparing with previous symptom that I got 50% failure rate. With saying that I’m sorry but I do not have context on the backend.

Hi BrandonE, I can have a try to possibly contact corresponding backend support to see whether there could be justification provided if that could help.

mybayern1974 on Mar 4, 2021

Thank you both. After reading the docs (and also figuring out that my scope is not /subscriptions/00000000-0000-0000-0000-000000000000, but rather /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/my-prod-rg/providers/Microsoft.Network/virtualNetworks/my-prod-vnet) I managed to implement the import workaround.

cosminstirbu on Feb 24, 2021

@BrandonE & @angelbarrera92

How did you find out the role Azure Resource ID that needs to be used with terraform import?

az role definition list --custom-role-only

BrandonE on Feb 24, 2021

@tiwood we’ve noticed a new role definition gets created behind the scenes during both Create and Update and is eventually reconciled on the backend (it’s why #9850 is stalled, since I’ve not had time to dig into it yet, but that instead switches to checking the CreatedAt/UpdatedAt fields) - it could well be this is related actually 🤔

tombuildsstuff on Feb 16, 2021

Ups - its actually only waiting for state if the resource is not new. I’m going to make a PR to also wait if the resource is new, because that’s why its failed for me

tiwood on Feb 16, 2021

This is working for me now, but I’d really appreciate official word from Azure to understand why this happened and how we can know that this won’t happen in the future.

BrandonE on Mar 7, 2021

This issue is expected to have been fixed in the service backend, which does not require Terraform changes. People who hit this issue please have a try.

Do you have documentation on the fix? This issue was catastrophic for me and I want to ensure it never happens again.

BrandonE on Mar 4, 2021

The strange thing is we’ve never had any issues creating role definitions in the past, but in the last week it has started failing consistently. Did anything change in the provider?

Same here. Our best workaround is to run it, let it fail, and then import the newly created role definitions into state. After that, everything operates as normal.

BrandonE on Feb 24, 2021