amazon-ecs-agent: Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database
Summary
Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database.
Description
Last week we ran into an issue with ECS where tasks would fail to start. There was no information in the ECS console other than task failed to start
but upon checking the ECS logs on the EC2 host under /var/log/ecs/ecs-agent.log
I saw the following:
level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet\nadd network failed
This was despite IP addresses being available in the subnet. A reboot of the server temporarily fixed the problem.
Looking more into the logs I also saw the following:
level=info time=2023-02-04T00:41:43Z msg="Managed task has reached stopped; waiting for container cleanup" task="fc50c79e949945a8855d63d2f3c27524"
level=info time=2023-02-04T00:41:43Z msg="Cleaning up the network namespace" task="fc50c79e949945a8855d63d2f3c27524" container="~internal~ecs~pause"
level=error time=2023-02-04T00:41:43Z msg="Delete network failed: failed to Statfs \"/host/proc/0/ns/net\": no such file or directory" module=plugin.go
level=error time=2023-02-04T00:41:43Z msg="Unable to cleanup pause container network namespace" task="fc50c79e949945a8855d63d2f3c27524" error="delete network failed: failed to Statfs \"/hosfile or directory"
level=info time=2023-02-04T00:41:43Z msg="IPAM releasing ip for task eni" task="fc50c79e949945a8855d63d2f3c27524"
level=error time=2023-02-04T00:41:43Z msg="Failed to release ip; IPAM error" task="fc50c79e949945a8855d63d2f3c27524" error="required env variables [CNI_CONTAINERID] missing"
This appears the cause of the IP address error above. Namely, the ECS agent is failing to release IPs from its IPAM database and not deleting stopped container’s network namespaces.
I tried reading through the code to determine why this became a problem last week, but nothing stood out to me except that the ContainerID seems to be null or 0 leading to the CNI_CONTAINERID missing
error and the odd /host/proc/0/ns/net
path (because why would it be checking PID 0?). It seems like this newer version is having a problem with not getting the container ID and then failing to clean up the network resources of the pause container? Or something like that; this is where my knowledge of the ECS agent becomes too hazy.
I also noticed that the ECS agent was updated from version 1.68.0 to 1.68.1 and the changelog specifically noted that the CNI plugins were updated which is the problematic area. A downgrade of the ECS agent we are running to v1.68.0 resulted in the above errors in the log disappearing so I concluded that the issue must be with the CNI plugin update from v.1.68.1 which was published to Docker Hub last week.
Expected Behavior
IP addresses should be released and tasks should start
Observed Behavior
IP addresses are not released and tasks will eventually fail to start.
Environment Details
Docker version: 23.0.0 OS: Ubuntu (kernel 5.15.0-1028-aws)
Supporting Log Snippets
level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet\nadd network failed
level=info time=2023-02-04T00:41:43Z msg="Managed task has reached stopped; waiting for container cleanup" task="fc50c79e949945a8855d63d2f3c27524"
level=info time=2023-02-04T00:41:43Z msg="Cleaning up the network namespace" task="fc50c79e949945a8855d63d2f3c27524" container="~internal~ecs~pause"
level=error time=2023-02-04T00:41:43Z msg="Delete network failed: failed to Statfs \"/host/proc/0/ns/net\": no such file or directory" module=plugin.go
level=error time=2023-02-04T00:41:43Z msg="Unable to cleanup pause container network namespace" task="fc50c79e949945a8855d63d2f3c27524" error="delete network failed: failed to Statfs \"/hosfile or directory"
level=info time=2023-02-04T00:41:43Z msg="IPAM releasing ip for task eni" task="fc50c79e949945a8855d63d2f3c27524"
level=error time=2023-02-04T00:41:43Z msg="Failed to release ip; IPAM error" task="fc50c79e949945a8855d63d2f3c27524" error="required env variables [CNI_CONTAINERID] missing"
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 6
- Comments: 18 (13 by maintainers)
1.68.2 Agent and AMIs have been published with the CNI rollback. I confirmed by pulling the latest parameters from us-east-1:
We have reverted the CNI plugins update in this PR: https://github.com/aws/amazon-ecs-agent/pull/3565 and are currently working to release it.
Agent 1.68.2 will have a fix for this once it’s ready.
While the above release is still pending, you can update multiple instances in your cluster at once, (or just one,) with the SSM send-command API. This assumes you have the SSM Agent installed and running on your hosts. It is installed by default in the ECS Optimized AMI. You’ll want to find all the instance IDs in your ECS cluster. These will be your
targets
:check the output of the above invocation to be sure it has succeeded:
As a mitigation, you can downgrade your ECS Agent in place First, stop the ecs service:
then you can pull the previous agent image either from Dockerhub or Public ECR:
or
Tag that image as
latest
:now start ecs again and check the running agent version (via the agent introspection API):
this should now show