cloudformation-cli-python-plugin: Cloudformation fails with `Internal Failure` and no CloudWatch Logs

For some reason, after successfully submitting to CFN registry via the CLI with the command: cfn submit, and successfully testing with sam local invoke TestEntrypoint --event test.json - which launches the required resources in my AWS Account, creating the resource via the Cloudformation fails with the error Internal Failure.

I tried to find related CloudWatch logs, but nothing is available? Any ideas I might be missing anything?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 18 (5 by maintainers)

Most upvoted comments

I’d recommend removing the looping logic and instead conditionally return

ProgressEvent(status=OperationStatus.IN_PROGRESS, resourceModel=model, message="Workflow creating")

or

ProgressEvent(status=OperationStatus.SUCCESS, resourceModel=model, message="Workflow created")

depending on the response from the API. You can set the deployment id in the callback context and that will be included in the next invocation request after the initial creation. You can then add some branching logic to execute your logic in the while loop if that is set in the callback context or simply try to create if it is not (the first invocation).

Our framework will handle reinvoking your resource handler. No need to try to stabilize on your own.

We got this and it turned out that our schema was invalid in a way that the registry wasn’t checking for but that failed deeper down in the service. I believe for us, it was that one of the properties used as primaryIdentifier had a value that was a list. Unfortunately, only the service team can actually find out, as these internal failures swallow the logs, which is a terrible customer experience.

Agree this isn’t great. FWIW, InternalFailure is really just a mask exception for something we haven’t handled specifically. If there are no CloudWatch logs then it’s unlikely your type is even being invoked. Schema errors are the most likely culprit, but I’d have to look at a stack ARN to know for sure. If you have a support setup file an issue through that channel first.

I think this will work for you

@resource.handler(Action.CREATE)
def create_handler(
    session: Optional[SessionProxy],
    request: ResourceHandlerRequest,
    callback_context: MutableMapping[str, Any],
) -ProgressEvent:
    model = request.desiredResourceState
    progress: ProgressEvent = ProgressEvent(
        status=OperationStatus.IN_PROGRESS,
        resourceModel=model,
        callbackContext=callback_context if callback_context is not None else {}
    )
    try:    
        req_payload = {
            'S3TrainingDataPath': model.S3TrainingDataPath,
            'TargetColumnName': model.TargetColumnName,
            'NotificationEmail': model.NotificationEmail,
            'WorkflowName': model.WorkflowName,
        }
        # auth = HTTPBasicAuth('API_KEY', '')
        LOG.info(f"Creating workflow ${model.WorkflowName}")
        # resuming from a long CREATE operation
        if 'DEPLOY_ID' in callback_context:
            req_payload['DeployId'] = callback_context['DEPLOY_ID']
        req = http.request(
            'POST', url=CD4AUTO_ML_API, headers=HTTP_REQUEST_HEADER, body=json.dumps(req_payload)
        )
        payload = json.loads(req.data)
        deploy_status = payload['DeployStatus']
        progress.callback_context['DEPLOY_ID'] = payload['DeployId']
        if deploy_status in ('FAILED', 'FAULT', 'STOPPED', 'TIMED_OUT'):
            LOG.error(f"Workflow ${model.WorkflowName} creation failed with status ${deploy_status}")
            progress.status = OperationStatus.FAILED            
        elif deploy_status == 'IN_PROGRESS':
            model.InferenceApi = payload.get('ApiUri', 'MyTestUrl')
            LOG.info(f"Created workflow ${model.WorkflowName} successfully")
            progress.status = OperationStatus.SUCCESS
        else:            
            progress.status = OperationStatus.IN_PROGRESS
        return progress

    except TypeError as e:
        # exceptions module lets CloudFormation know the type of failure that occurred
        LOG.error(f"Workflow creation failed with status ${e}")
        raise exceptions.InternalFailure(f"was not expecting type {e}")
        # this can also be done by returning a failed progress event
        # return ProgressEvent.failed(HandlerErrorCode.InvalidRequest, f"was not expecting type {e}")
    return ProgressEvent(status=OperationStatus.SUCCESS, resourceModel=model, message="Workflow created")

Great! Suggest you cut an issue to the main https://github.com/aws-cloudformation/cloudformation-cli repo showing that regex is a problem. We’ll have to fix the validation.

For the new issue, you should be able to see logs now? This error occurs if your type does not return a primaryIdentifier value back to CloudFormation within 1 minute. The way to fix this is to return a ProgressEvent.progress with your ResourceModel populated with it’s primaryIdentifier as soon as you have it, and then continue stabilization.

@rjlohan, I removed the Regex and it seems to work now. However, I get the error: Resource timed out waiting for creation of physical resource.

@rjlohan, thanks. Below is the schema:

{
    "typeName": "CD4AutoML::Workflow::Deploy",
    "description": "Cd4AutoML managed end-to-end workflow with model serving REST API",
    "properties": {
        "S3TrainingDataPath": {
            "description": "S3 Path containing training data. Data format must CSV with headers.",
            "type": "string"
        },
        "TargetColumnName": {
            "description": "The name of the target column to be predicted. This MUST be last column in the CSV data in S3",
            "type": "string"
        },
        "NotificationEmail": {
            "description": "Valid email address to receive email notifications.",
            "type": "string",
            "pattern": "^[\\x20-\\x45]?[\\w-\\+]+(\\.[\\w]+)*@[\\w-]+(\\.[\\w]+)*(\\.[a-z]{2,})$",
            "examples": [
                "test@example.com"
            ]
        },
        "WorkflowName": {
            "description": "Unique Name or Identifier for CD4AutoML workflow",
            "type": "string",
            "maxLength": 15
        },
        "Schedule": {
            "description": "Number of days before retraining AutoML model.",
            "type": "string"
        },
        "InferenceApi": {
            "description": "REST API URI for real time model inference available once workflow is completed",
            "type": "string"
        }
    },
    "additionalProperties": false,
    "required": [
        "S3TrainingDataPath",
        "TargetColumnName",
        "NotificationEmail",
        "WorkflowName"
    ],
    "readOnlyProperties": [
        "/properties/InferenceApi"
    ],
    "primaryIdentifier": [
        "/properties/WorkflowName"
    ],
    "handlers": {
        "create": {
            "permissions": [""]
        },
        "read": {
            "permissions": [""]
        },
        "update": {
            "permissions": [""]
        },
        "delete": {
            "permissions": [""]
        },
        "list": {
            "permissions": [""]
        }
    }
}

What may I be missing?