terraform-provider-azurerm: azurerm_role_definition WaitForState pending state not set correctly

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform v0.14.6

  • provider registry.terraform.io/hashicorp/azurerm v2.36.0

Affected Resource(s)

  • azurerm_role_definition

Terraform Configuration Files

resource "azurerm_role_definition" "example" {
  name        = "my-custom-role"
  scope       = "ROOT MANAGEMENT GROUP"
  description = "This is a custom role created via Terraform"

  permissions {
    actions     = ["*"]
    not_actions = []
  }

  assignable_scopes = [
    "ROOT MANAGEMENT GROUP"
  ]
}

Debug Output

Error: Provider produced inconsistent result after apply

When applying changes to
azurerm_role_definition.custom_roles["workload_contributor.yml"], provider
"registry.terraform.io/hashicorp/azurerm" produced an unexpected new value:
Root resource was present, but now absent.

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Steps to Reproduce

  1. terraform apply

Important Factoids

It seems the WaitForState() function is called with the wrong Pending state property, as the API return 404 if the role is not found YET.

Currently:

	if !d.IsNewResource() {
		id, err := parse.RoleDefinitionId(d.Id())
		if err != nil {
			return err
		}
		stateConf := &resource.StateChangeConf{
			Pending: []string{
				"Pending",
			},
			Target: []string{
				"OK",
			},
			Refresh:                   roleDefinitionUpdateStateRefreshFunc(ctx, client, id.ResourceID),
			MinTimeout:                10 * time.Second,
			ContinuousTargetOccurence: 6,
			Timeout:                   d.Timeout(schema.TimeoutUpdate),
		}

		if _, err := stateConf.WaitForState(); err != nil {
			return fmt.Errorf("waiting for update to Role Definition %q to finish replicating", name)
		}
	}

Should be:

	if !d.IsNewResource() {
		id, err := parse.RoleDefinitionId(d.Id())
		if err != nil {
			return err
		}
		stateConf := &resource.StateChangeConf{
			Pending: []string{
				"NotFound",
			},
			Target: []string{
				"OK",
			},
			Refresh:                   roleDefinitionUpdateStateRefreshFunc(ctx, client, id.ResourceID),
			MinTimeout:                10 * time.Second,
			ContinuousTargetOccurence: 6,
			Timeout:                   d.Timeout(schema.TimeoutUpdate),
		}

		if _, err := stateConf.WaitForState(); err != nil {
			return fmt.Errorf("waiting for update to Role Definition %q to finish replicating", name)
		}
	}

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 36
  • Comments: 25 (5 by maintainers)

Most upvoted comments

This is breaking automation for a lot of people. Personally I have Terraform which should be able to automatically run on a daily basis that I now have to manually intervene with every time. Can we get some transparency on a roadmap or timeline on when this issue will be addressed? Coming over from #10442 it seems like this issue is not being properly recognized as a massively disruptive part of many peoples work days.

In order to improve call latencies we had introduced distributed caching of Role Definitions. As a result of this change some customers experienced failures when a role definition is read immediately after writing. The root cause of the failure was that the read request was served from cache which was not updated immediately after the write. We have disabled the distributed caching of Role Definitions which has fixed the failures.

The root cause has been well understood and we will take extreme care in the future to prevent repeat occurrences of similar failures for this scenario and other scenarios.

There is still a small probability that read after write scenarios can experience this failure since we cache Role Definitions in-memory as well. The in-memory cache is not new and we have had it for a long time. So the probability of this issue happening again is the same as it was before we added the distributed cache.

Thank you all for the feedback which helps make our services better.

This issue is expected to have been fixed in the service backend, which does not require Terraform changes. People who hit this issue please have a try.

I am another customer whose deploys have all started working reliably again for a few days suggesting this has been fixed.

Hi all, I posted my findings here after seeing things became stable in my local Terraform project involving the role definition story, with comparing with previous symptom that I got 50% failure rate. With saying that I’m sorry but I do not have context on the backend.

Hi BrandonE, I can have a try to possibly contact corresponding backend support to see whether there could be justification provided if that could help.

Thank you both. After reading the docs (and also figuring out that my scope is not /subscriptions/00000000-0000-0000-0000-000000000000, but rather /subscriptions/00000000-0000-0000-0000-000000000000/resourceGroups/my-prod-rg/providers/Microsoft.Network/virtualNetworks/my-prod-vnet) I managed to implement the import workaround.

@BrandonE & @angelbarrera92

How did you find out the role Azure Resource ID that needs to be used with terraform import?

az role definition list --custom-role-only

@tiwood we’ve noticed a new role definition gets created behind the scenes during both Create and Update and is eventually reconciled on the backend (it’s why #9850 is stalled, since I’ve not had time to dig into it yet, but that instead switches to checking the CreatedAt/UpdatedAt fields) - it could well be this is related actually 🤔

Ups - its actually only waiting for state if the resource is not new. I’m going to make a PR to also wait if the resource is new, because that’s why its failed for me

This is working for me now, but I’d really appreciate official word from Azure to understand why this happened and how we can know that this won’t happen in the future.

This issue is expected to have been fixed in the service backend, which does not require Terraform changes. People who hit this issue please have a try.

Do you have documentation on the fix? This issue was catastrophic for me and I want to ensure it never happens again.

The strange thing is we’ve never had any issues creating role definitions in the past, but in the last week it has started failing consistently. Did anything change in the provider?

Same here. Our best workaround is to run it, let it fail, and then import the newly created role definitions into state. After that, everything operates as normal.