amazon-ecs-agent: Custom AMIs built from latest ECS optimised AMI don't always connect to ECS

Summary

The latest ecs-agent version (1.48.0) contains a feature which interacts badly with some established ways of creating custom AMIs.

Description

We have an system for building custom AMIs using the latest public optimised AMI, i.e. https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html. We observed our newest AMI failed to connect to ECS - apart from one solitary success they all complained in the ecs-agent.log:

Unable to register as a container instance with ECS: InvalidParameterException: Arguments on this idempotent request are inconsistent with arguments used in previous request(s).\n\tstatus code: 400

Background

We create the custom AMI using Hashicorp Packer with the “amazon-ebs” builder, so when it starts an instance from the ECS Optimised AMI to customise it, the ecs-agent starts and runs briefly (without permissions to connect to a valid ECS cluster). Note a general system update is also executed, which updated the ecs-agent to the latest version, 1.48.0.

Previously this didn’t cause any problems, but I suspect possibly the code in https://github.com/aws/amazon-ecs-agent/pull/2708 is causing the ECS service state to contain a new UUID which is persisted to the AMI. That lets one new instance join the intended cluster, but no more (for a while).

We’ve fixed this by improving our AMI customisation scripts to shut down the ecs-agent service & clean state (rm -rf /var/lib/ecs/data/*), before letting it start only when fully configured when used for a real instance.

Expected Behavior

Custom AMIs reliably join ECS clusters

Observed Behavior

Custom AMIs need new steps to work reliably. I fear a number of other people/companies will be using a similar pattern & so will hit the same issue… it would be nice if the UUID was not created if the ecs-agent were in a vanilla unconfigured state.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 31
  • Comments: 23 (7 by maintainers)

Most upvoted comments

We hear you, we are currently working on rolling back this change.

I would probably recommend that anyone continue to clear the state file when creating an AMI, though we will work on a solution to the RegisterContainerInstance token issue that we were fixing in #2708 that doesn’t affect this use-case.

Apologies again for the bug, FWIW if we knew it would be a breaking change we would have notified and communicated this better. This is not a use-case that we test internally but I can assure you we are considering doing that.

Thanks @awsvikrant for the workaround. But honestly the workaround is not a scalable option. If I have autoscaling scenario, new instances will never join my cluster unless I remove the data files. I suggest to remove the workaround available tag and fix the issue asap. @sharanyad

Personally I think you should pull this release, or release a new one with a revert of the whole PR. Anyone who customises the AMI to make another AMI will hit this unless they run a proper Netflix-style bakery, which IMHO is rare.

Luckily we only lost some non-critical services & wasted a small amount of money before we noticed the issues, others might not be so lucky.

@sparrc the very fact that the base ami version 1.47 worked and 1.48 didn’t is enough to say its a breaking change. I however feel if this is going to be the new normal, can you publish it out as a documentation/README/release notes along with common examples say for packer (amazon-ebs builder). It should then be a major version say 2.xx and clearly highlighting backward incompatible change

Please revert this. It caused massive problems for us today.

I don’t think the workaround is suitable to have in our pipelines. I assume most people are under the belief that the ECS agent will always work if service is on and aren’t checking for the presence of state data.

We were able to reproduce the issue, we recommend that anyone creating an AMI from the ecs-optimized ami stop the ecs agent and then remove it’s data directory (sudo systemctl stop ecs && sudo rm -rf /var/lib/ecs/data/*) before creating the AMI. If this workaround is not workable for any of you, could you please be more specific as to why it does not work?

Regardless of the recently merged PR, it is not intended that a single agent state file is persisted across multiple instances, and more fields may be added in the future that could also break using the file across instances. If an AMI is created with a state file in place, then instances launched from that AMI will contain copies of that state file and would be subject to the issue at hand.

One thing to note is that the first instance launched from the AMI with the state file will be able to join the cluster, so this should explain why some may have noticed that the instances do sometimes join the cluster.

EDIT: to clarify, @tyrken is correct, this issue began with PR #2708 and is related to the statefile persistent of the RegisterContainerInstance token, which is why we are suggesting this state file be cleared before creating the AMI.

This is affecting us as well and has caused a lot of headaches today. Slightly ridiculous that this was released without notifying people that this is breaking change.

Hi AWS, I think we need to resolve this ASAP.

I think we are among many of the other organisations that do scheduled AMI builds and push this to prod after some basic tests. This will affect many of the prod environment in the next few days to come if we don’t act promptly.

What’s worse, as I’ve been fighting with this for the last two days and from my experiments, in some cases the AMI built from the ECS base AMI will be able to register and join the correct ECS cluster. This means that in some cases, people will only get this in prod, and that would be a big surprise for them.

My suggestion is to pull this release unless we are confident that a bugfix release is imminent, also we need to release a new ECS base AMI in SSM, as I believe that’s where most people pick up this release.

@an-sush

did you just jump the gun? don’t see updated yet: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-versions.html#ecs-ami-versions-linux

Probably just a delay in that page being updated. The new AMIs ( amzn2-ami-ecs-hvm-2.0.20201125-x86_64-ebs ) are already available to the SSM query: aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended

We have released a new AMI with Agent version - 1.48.1. It reverts the change #2708 causing the issue. Resolving this issue.

@sparrc when I add the work around to our packer build process the EC2 instances are able to register with ECS. This leads to a new error though where all of the containers are only loading into one EC2 instance when they should be spreading out across the cluster.

@sparrc is there a timeline for rollback? Shouldn’t it be easy as flipping the flag/commit and releasing it. Or are we still thinking to push the change which isn’t unit tested by your CI process? It caused a massive outage in our prod this week. I can’t imagine Amazon releasing things which are not tested

+1, I am also facing this issue in custom AMI created on top of ecs-optimize(amzn2-ami-ecs-hvm-2.0.20201119-x86_64-ebs). To fix this problem I also had to stop ecs service and delete rm -rf /var/lib/ecs/data/*.