aws-cdk: OpenSearch: Bug in Describe-Domain API is causing CFN GetAtt "Internal error occurred"

What is the problem?

When you create an OpenSearch Domain with a VPC and then attempt to reference that endpoint in the AWS CDK (thereby creating a GetAtt reference in CloudFormation), the Domain creates successfully, but then the CloudFormation resource (Fargate) that attempts to reference the endpoint returns an “Internal error occurred” (see attached screenshot). Additional findings from research detailed in “Other information” below. Screen Shot 2022-01-02 at 21 45 28

Reproduction Steps

self.opensearch_domain = opensearch.Domain(self, "OpenSearchIndices",
    **opensearch_params,
    version=opensearch.EngineVersion.OPENSEARCH_1_0,
    vpc=self.scope.network_stack.vpc,        
    logging={
        "slow_search_log_enabled": True,
        "app_log_enabled": True,
        "slow_index_log_enabled": True
    },
    encryption_at_rest={
        "enabled": True
    },
    zone_awareness=opensearch.ZoneAwarenessConfig(
        enabled=True,
        availability_zone_count=zone_count
    ),
    removal_policy = self.data_resources_removal_policy
)
self.opensearch_endpoint = self.opensearch_domain.domain_endpoint

What did you expect to happen?

All resource created successfully

What actually happened?

CloudFormation Stack rollback due resource creation failure. (Screenshot from above re-attached here) Screen Shot 2022-01-02 at 21 45 28

CDK CLI Version

2.3

Framework Version

No response

Node.js Version

16.13.1

OS

Mac OS 12.1

Language

Python

Language Version

3.10.1

Other information

I noticed that I didn’t have this problem when creating a public OpenSearch Domain. So I thought it might have something to do with how the API is returning domain endpoints with Domains created in a VPC vs public Domains.

I created a public Domain and then ran aws opensearch describe-domain against both the Domain created with the CDK and the test public Domain. Here were the results:

# Public Domain
~ % aws opensearch describe-domain --domain-name test | jq '.DomainStatus.Endpoint'
"search-test-xxxxxxxxx.us-east-1.es.amazonaws.com"
# Domain in VPC
~ % aws opensearch describe-domain --domain-name dataindic-xxxxxxxxx | jq '.DomainStatus.Endpoint' 
null
~ % aws opensearch describe-domain --domain-name dataindic-xxxxx | jq '.DomainStatus.Endpoints'
{
  "vpc": "vpc-xxxxxxx-yyyyyyy-zzzzzzzz.us-east-1.es.amazonaws.com"
}

As you can see, the Endpoint value is null for Domains in the VPC. Instead, it appears to put that value in a new key called “Endpoints”. It appears that maybe CloudFormation wasn’t updated to support the new “Endpoints” key or OpenSearch should be publishing endpoints for Domains in the VPC.

I understand that this might be a CloudFormation or OpenSearch bug, but until those teams sort it out, it’s obviously a bug in the AWS CDK. And it seems like this is something the CDK could maybe work around for the time being with a custom resource. Example:

opensearch_client = boto3.client('opensearch')
opensearch_domain_details = opensearch_client.describe_domain(
      DomainName=aws_opensearch_domain_name
 )['DomainStatus']
opensearch_endpoint = opensearch_domain_details.get('Endpoint') or opensearch_domain_details.get('Endpoints')['vpc']

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 6
Comments: 39 (19 by maintainers)

Most upvoted comments

Yes, the service team has gotten back to me and confirmed for the stack arns available, that all of them were due to the throttling limits.

I’m working with them on getting the error message improved, to make it clear to the user what the failure is being caused by

@peterwoodworth I could imagine this being our issue too, although we only have 5 task definitions using that value. This seems like something CloudFormation should just auto-retry?

automartin5000 on Feb 25, 2022

@peterwoodworth said:

I’ve been told that you will now receive a proper error message in the case of throttling. Can anyone here confirm this is the case?

Most of us currently following here have some form of workaround for this issue in place, and I don’t think any of us will be removing that workaround until this issue is fixed properly. We will not be removing our workarounds, because we cannot expose our stacks to non-deterministic failures. I’ve already described the experience I had where the rollback failed because I hit this error during a non-reversible upgrade. A decent error message would not have helped me get out of the awful position this Cloudformation defect left me in.

A proper error message is better than nothing. However throttling is an implementation detail of the deployments Cloudformation does that we should not be exposed to as users at all. The abstraction is leaking.

SamStephens on Aug 9, 2022

You can work around this issue by using CDK’s dependency mechanism to slow down the API requests. If you have multiple resources (e.g. A, B, C, D) accessing OpenSearch Domain attributes, you can make them execute one after the other (rather than simultaneously) with D.node.addDependency(C); C.node.addDependency(B); B.node.addDependency(A); - see details in the docs.

You can also attempt more drastic solutions like using a Lambda custom resource or Systems Manager parameters, but I think the dependency mechanism is the simplest way to do it.

Some time ago, @peterwoodworth asked for an update on the tracking ticket but there is no update to share at this time. I don’t work for AWS (hence not involved in the prioritization of issues) but adding a +1 to this issue may help to get a faster resolution.

jwang1048 on Apr 27, 2022

In my case it was caused because I was passing the endpoint URL to a lambda (CustomResource) on each call (which I guess caused resolution of the attribute multiple times + throttling).

I changed it to environment variable in the lambda and its solved now.

PS: However I think this should be fixed in AWS/CDK-side by implementing retries with backoff (or whatever the mechanism), I can’t apply this solution for all our use-cases and also this essentially limits the amount of resources you can deploy within your CDK…

TinoSM on Feb 28, 2022

Hi, I saw a similar error and opened an AWS support case (but didn’t get my problem resolved there). I am able to reliably reproduce the error with a small snippet of (mostly) vanilla CDK code :

import * as os from "monocdk/aws-opensearchservice";
import * as lambda from "monocdk/aws-lambda";
import {SecurityGroup, SubnetType, Vpc} from "monocdk/aws-ec2";
import {App, RemovalPolicy, Stack, Tags} from "monocdk";
export class TestStack extends DeploymentStack {
    constructor(parent: App, id: string, props: TestStackProps) {
       super(parent, id, {// company specific boilerplate})
       const vpc = new Vpc(this, 'TheVPC', {
            cidr: "10.1.0.0/16",
            maxAzs: 1
        });
        const devDomain = new os.Domain(this, 'Domain', {
            version: os.EngineVersion.ELASTICSEARCH_7_10,
            vpc: vpc,
            enforceHttps: true,
            removalPolicy: RemovalPolicy.DESTROY,
        });
        for (let i = 0; i < 10; i++) {
            const func = new lambda.Function(this, "Func" + i, {
                runtime: lambda.Runtime.PYTHON_3_8,
                handler: "test",
                environment: {
                    "ES_ENDPOINT": devDomain.domainEndpoint,
                },
                vpc: vpc,
                allowPublicSubnet: false,
                code: new lambda.InlineCode(`import os
def test(event, context):
  print(event)
  return os.environ`)
            });
            Tags.of(func).add("endpoint", devDomain.domainEndpoint);
            //Tags.of(func).add("arn", devDomain.domainArn);
        }

Looking at the CloudTrail logs, it appears that there are throttling exceptions on the ListTags and DescribeDomain operations. Most likely it was caused by a throttling on this DescribeDomain request (the last one). I was unable to find any more details about the requests.

{
    "eventVersion": "1.08",
    "userIdentity": {/*removed*/}
    "eventTime": "2022-02-24T13:52:31Z",
    "eventSource": "es.amazonaws.com",
    "eventName": "DescribeDomain",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    "errorCode": "ThrottlingException",
    "errorMessage": "Rate exceeded",
    "requestParameters": null,
    "responseElements": null,
    "requestID": "c70a9982-c6a1-4cfe-861a-53667e144e50",
    "eventID": "ee0d4410-a20c-4903-87f4-d5cda62c000c",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "970187786794",
    "eventCategory": "Management"
}

Additionally, it does appear to be a throttling issue on API calls, as reducing the number of Lambda functions from 10 to 1 allows the stack to create successfully. I am not entirely sure how CloudFormation does the tagging, but it appears to make an API call for each tag, which could be quite a lot for a Lambda function in a VPC (associated with a security group and IAM role at least).

Stack ID: arn:aws:cloudformation:us-west-2:970187786794:stack/TestStack-beta-us-west-2/cdc86410-94e0-11ec-a667-023be3ac2b21

(Issue is reproducible in us-east-1 as well).

The workaround I’m currently trying with success is reducing the number of Fn::GetAtt calls to the domain resource by eliminating excess tags. Maybe you could also try to “spread out” the API requests by interleaving them with resources that take longer to create using CDK’s dependency mechanism.

jwang1048 on Feb 24, 2022