terraform-provider-aws: [Bug]: IPAM allocation fails with "InvalidIpamPoolAllocationId"

Terraform Core Version

1.3.2

AWS Provider Version

4.32.0 and 4.50.0

Affected Resource(s)

aws_vpc_ipam_pool_cidr_allocation

Expected Behavior

I expected that the IPAM allocation will be created successfully.

Actual Behavior

In our environment we are using a multi-account setup. The IPAM pools are created in one account and shared with RAM to another account. We are running in the below mentioned issue when we want to allocate an CIDR in the shared IPAM pool.

To create our IPAM pool allocation we are using this snippet in our code:

resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra" {
   count = var.cf_subnet_infra_count
   ipam_pool_id = var.ipam_pool_id
   netmask_length = 27
}

But immediately afterwards we get the following error:

Error: InvalidIpamPoolAllocationId.NotFound: The IPAM pool allocation (ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01) does not exist.
status code: 400, request id: 9683f21c-8972-4c40-8227-72f5c219e5d3

with aws_vpc_ipam_pool_cidr_allocation.vpc-ipam-pool-alloc-cidr-cf-subnet-infra[0],
on ipam_pool_allocations.tf line 1, in resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra":
1: resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra" {

When we run “aws ec2 get-ipam-pool-allocations --ipam-pool-id ipam-pool-0b325bb4efc6dacae”, I properly get returned all IpamPoolAllocations.

  • We have not changed anything of the Terraform code in regards to the IPAM pool allocations.
  • 3 different persons tried running the setup with assuming role “workload-terraform-role” and ran into the issue too.
  • We also tried running aws_vpc_ipam_pool_cidr_allocation in a different AWS account and ran also into that issue.
  • On previous runs couple weeks/months ago, the same code correctly created the aws_vpc_ipam_pool_cidr_allocation without throwing the error.

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

resource "aws_vpc_ipam_pool_cidr_allocation" "vpc-ipam-pool-alloc-cidr-cf-subnet-infra" {
   count = var.cf_subnet_infra_count
   ipam_pool_id = var.ipam_pool_id
   netmask_length = 27
}

Steps to Reproduce

  1. Create an IPAM pool
  2. Try to allocate a CIDR in the pool

Debug Output

2023-01-16T10:54:34.077+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Action=GetIpamPoolAllocations&IpamPoolAllocationId=ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01&IpamPoolId=ipam-pool-0b325bb4efc6dacae&Version=2016-11-15
2023-01-16T10:54:34.077+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: -----------------------------------------------------
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Response ec2/GetIpamPoolAllocations Details:
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: ---[ RESPONSE ]--------------------------------------
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: HTTP/1.1 400 Bad Request
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Connection: close
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Transfer-Encoding: chunked
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Cache-Control: no-cache, no-store
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Content-Type: text/xml;charset=UTF-8
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Date: Mon, 16 Jan 2023 09:54:33 GMT
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Server: AmazonEC2
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Strict-Transport-Security: max-age=31536000; includeSubDomains
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: Vary: accept-encoding
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: X-Amzn-Requestid: 9683f21c-8972-4c40-8227-72f5c219e5d3
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: 
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: 
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: -----------------------------------------------------
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: [DEBUG] [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: <Response><Errors><Error><Code>InvalidIpamPoolAllocationId.NotFound</Code><Message>The IPAM pool allocation (ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01) does not exist.</Message></Error></Errors><RequestID>9683f21c-8972-4c40-8227-72f5c219e5d3</RequestID></Response>
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5: [DEBUG] [aws-sdk-go] DEBUG: Validate Response ec2/GetIpamPoolAllocations failed, attempt 0/25, error InvalidIpamPoolAllocationId.NotFound: The IPAM pool allocation (ipam-pool-alloc-0f1fe03456e174fea9c82affb5ee35e01) does not exist.
2023-01-16T10:54:34.330+0100 [DEBUG] provider.terraform-provider-aws_v4.32.0_x5:     status code: 400, request id: 9683f21c-8972-4c40-8227-72f5c219e5d3

Panic Output

No response

Important Factoids

No response

References

No response

Would you like to implement a fix?

None



Alexander Barth (alexander.barth@mercedes-benz.com) on behalf of Mercedes-Benz Tech Innovation GmbH, Provider Information

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 23
  • Comments: 18 (5 by maintainers)

Most upvoted comments

Hello Kevin, thanks for putting the effort in for the sample code! In the provider we have mechanisms for retries and waiting (retries and waiters), and our PR guidelines suggest that we follow any exiting patterns in the resource being modified.

I have added the mechanisms for retry and waiting to account for eventually consistency of the read operation, and I’ve added additional acceptance tests to verify cross region pool CIDR allocation.

I assure you this is being worked on. The provider team does releases each Thursday.

I can successfully reproduce in an AccTests. Working on a fix.

We’re seeing similar issues with that resource as well (using 4.50.0). IPAM pool isn’t shared with RAM in our case, all operations happen in the same AWS account.

First attempt

plan

# aws_vpc_ipam_pool_cidr_allocation.workload[0] will be created
+ resource "aws_vpc_ipam_pool_cidr_allocation" "workload" {
  + cidr                    = (known after apply)
  + description             = "some desc"
  + id                      = (known after apply)
  + ipam_pool_allocation_id = (known after apply)
  + ipam_pool_id            = "ipam-pool-xxxxxxxxxx"
  + netmask_length          = 25
  + resource_id             = (known after apply)
  + resource_owner          = (known after apply)
  + resource_type           = (known after apply)
}

apply

apply fails with the following error message, however if we check the AWS Console the allocation is well created in IPAM service.

Error: reading IPAM Pool CIDR Allocation (ipam-pool-alloc-xxxxxxxxxxx_ipam-pool-xxxxxxxxxx): couldn't find resource

Second attempt

plan

Resource is shown as tainted.

# aws_vpc_ipam_pool_cidr_allocation.workload[0] is tainted, so must be replaced
-/+ resource "aws_vpc_ipam_pool_cidr_allocation" "workload" {
  + cidr                    = (known after apply)
  ~ id                      = "ipam-pool-alloc-xxxxxxxxxxx_ipam-pool-xxxxxxxxxx" -> (known after apply)
  + ipam_pool_allocation_id = (known after apply)
  + resource_id             = (known after apply)
  + resource_owner          = (known after apply)
  + resource_type           = (known after apply)
    # (3 unchanged attributes hidden)
}

apply

Because the resource is tainted, it is being deleted, but that fails as well.

Error: deleting IPAM Pool CIDR Allocation (ipam-pool-alloc-xxxxxx_ipam-pool-xxxxx): InvalidParameterValue: The CIDR specified :  is not in proper format.

(^ not a typo in error message, a value is missing)

EDIT: We’ve opened a case with AWS support in the meantime, as we believe this is likely to be an issue with AWS IPAM service API rather than the provider. We were able to replicate the issue with AWS CLI as well.

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID I consistently get correct result.

AWS will not support unless we show them that their CLI also fails

A quick clarification, AWS Enterprise Support does offer Third-Party Product support, including open source software such as Terraform. I agree that having a reproducible case in a script using the AWS CLI is certainly helpful, though not required.

AWS works with Hashicorp and the open source community to evaluate and prioritize issues as per the Terraform AWS Provider FAQ.

I can successfully reproduce in an AccTests. Working on a fix.

I also just started on this issue and added this code block to ipam_pool_cidr_allocation.go

	// Handle eventual consitency of the API and therefor retry the read
	return resource.Retry(time.Minute, func() *resource.RetryError {
		err = resourceIPAMPoolCIDRAllocationRead(d, meta)

		if err != nil {
			if tfresource.NotFound(err) {
				return resource.RetryableError(fmt.Errorf("IPAM Pool CIDR Allocation (%s) not yet ready", d.Id()))
			} else {
				return resource.NonRetryableError(err)
			}
		}

		return nil
	})

We need this change urgently. Do you work on this within the next few days or should I open a PR? If the latter, could you share your test code?

@AdamTylerLynch I will do that. I just thought it was worth mentioning that using import was not a workaround for us.

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID I consistently get correct result. AWS will not support unless we show them that their CLI also fails

I’ve been running the following script, and issue happens randomly after a couple runs

script.sh

#!/bin/bash

set -e

export AWS_REGION=eu-central-1
export AWS_DEFAULT_REGION=eu-central-1
export AWS_DEFAULT_OUTPUT=json

IPAM_POOL_ID="ipam-pool-xxxxxxxxxxxxxxxxx"
ALLOCATION_ID="$(aws ec2 allocate-ipam-pool-cidr --ipam-pool-id "$IPAM_POOL_ID" --netmask-length 25 --description 'troubleshoot' | jq -c -r '.IpamPoolAllocation.IpamPoolAllocationId')"

aws ec2 get-ipam-pool-allocations \
    --ipam-pool-id "$IPAM_POOL_ID" \
    --ipam-pool-allocation-id "$ALLOCATION_ID" \
    --no-cli-pager

Interesting. For me the script always runs through without any issues and the creation through terraform still throws the error InvalidIpamPoolAllocationId.NotFound. Exact same IAM-Role used.

Mili Durasovic mili.durasovic@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH Provider Information

Hi @Tailzip - Can you tell me the CLI steps to reproduce. When I do the following aws ec2 get-ipam-pool-allocations --ipam-pool-id POOL-ID --ipam-pool-allocation-id ALLOCATION-ID I consistently get correct result.

AWS will not support unless we show them that their CLI also fails

I’ve been running the following script, and issue happens randomly after a couple runs

script.sh
#!/bin/bash

set -e

export AWS_REGION=eu-central-1
export AWS_DEFAULT_REGION=eu-central-1
export AWS_DEFAULT_OUTPUT=json

IPAM_POOL_ID="ipam-pool-xxxxxxxxxxxxxxxxx"
ALLOCATION_ID="$(aws ec2 allocate-ipam-pool-cidr --ipam-pool-id "$IPAM_POOL_ID" --netmask-length 25 --description 'troubleshoot' | jq -c -r '.IpamPoolAllocation.IpamPoolAllocationId')"

aws ec2 get-ipam-pool-allocations \
    --ipam-pool-id "$IPAM_POOL_ID" \
    --ipam-pool-allocation-id "$ALLOCATION_ID" \
    --no-cli-pager

We have the exact same problem as @Tailzip. We are referencing the cidr in a local. It seems that the local is being evaluated way too early. The terraform resource might indicate that the allocation is done, but it seems like it’s still ongoing asynchronously in AWS.

We tried to remove the tainted resource and then import the resource, but that doesn’t seem to work. Afterwards it showed us a completely new resource being created. Applying that will result in the mentioned error again (“couldn’t find resource”). The import statement looks a little bit weird as well. The text says that the allocation id is used for the import, but the example only shows the resource.

EDIT: We verified the problem with 4.19.0, 4.46.0 and 4.48.0. It seems that it started last week as sporadic behavior, but now this is a constant behavior.