amazon-ecs-agent: Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database

Summary

Tasks fail to start with version 1.68.1 due to failure to release IP addresses from IPAM database.

Description

Last week we ran into an issue with ECS where tasks would fail to start. There was no information in the ECS console other than task failed to start but upon checking the ECS logs on the EC2 host under /var/log/ecs/ecs-agent.log I saw the following:

level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet\nadd network failed

This was despite IP addresses being available in the subnet. A reboot of the server temporarily fixed the problem.

Looking more into the logs I also saw the following:

level=info time=2023-02-04T00:41:43Z msg="Managed task has reached stopped; waiting for container cleanup" task="fc50c79e949945a8855d63d2f3c27524"
level=info time=2023-02-04T00:41:43Z msg="Cleaning up the network namespace" task="fc50c79e949945a8855d63d2f3c27524" container="~internal~ecs~pause"
level=error time=2023-02-04T00:41:43Z msg="Delete network failed: failed to Statfs \"/host/proc/0/ns/net\": no such file or directory" module=plugin.go
level=error time=2023-02-04T00:41:43Z msg="Unable to cleanup pause container network namespace" task="fc50c79e949945a8855d63d2f3c27524" error="delete network failed: failed to Statfs \"/hosfile or directory"
level=info time=2023-02-04T00:41:43Z msg="IPAM releasing ip for task eni" task="fc50c79e949945a8855d63d2f3c27524"
level=error time=2023-02-04T00:41:43Z msg="Failed to release ip; IPAM error" task="fc50c79e949945a8855d63d2f3c27524" error="required env variables [CNI_CONTAINERID] missing"

This appears the cause of the IP address error above. Namely, the ECS agent is failing to release IPs from its IPAM database and not deleting stopped container’s network namespaces.

I tried reading through the code to determine why this became a problem last week, but nothing stood out to me except that the ContainerID seems to be null or 0 leading to the CNI_CONTAINERID missing error and the odd /host/proc/0/ns/net path (because why would it be checking PID 0?). It seems like this newer version is having a problem with not getting the container ID and then failing to clean up the network resources of the pause container? Or something like that; this is where my knowledge of the ECS agent becomes too hazy.

I also noticed that the ECS agent was updated from version 1.68.0 to 1.68.1 and the changelog specifically noted that the CNI plugins were updated which is the problematic area. A downgrade of the ECS agent we are running to v1.68.0 resulted in the above errors in the log disappearing so I concluded that the issue must be with the CNI plugin update from v.1.68.1 which was published to Docker Hub last week.

Expected Behavior

IP addresses should be released and tasks should start

Observed Behavior

IP addresses are not released and tasks will eventually fail to start.

Environment Details

Docker version: 23.0.0 OS: Ubuntu (kernel 5.15.0-1028-aws)

Supporting Log Snippets

level=warn time=2023-02-03T19:13:12Z msg="Error starting/provisioning container[%s (Runtime ID: %s)];" runtimeID="cd706915fe88145c0fcc8f9f512514930970beb6dd1aa8c58ebda062151f16cb" error="container resource provisioning: failed to setup network namespace: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet\nadd network failed
level=info time=2023-02-04T00:41:43Z msg="Managed task has reached stopped; waiting for container cleanup" task="fc50c79e949945a8855d63d2f3c27524"
level=info time=2023-02-04T00:41:43Z msg="Cleaning up the network namespace" task="fc50c79e949945a8855d63d2f3c27524" container="~internal~ecs~pause"
level=error time=2023-02-04T00:41:43Z msg="Delete network failed: failed to Statfs \"/host/proc/0/ns/net\": no such file or directory" module=plugin.go
level=error time=2023-02-04T00:41:43Z msg="Unable to cleanup pause container network namespace" task="fc50c79e949945a8855d63d2f3c27524" error="delete network failed: failed to Statfs \"/hosfile or directory"
level=info time=2023-02-04T00:41:43Z msg="IPAM releasing ip for task eni" task="fc50c79e949945a8855d63d2f3c27524"
level=error time=2023-02-04T00:41:43Z msg="Failed to release ip; IPAM error" task="fc50c79e949945a8855d63d2f3c27524" error="required env variables [CNI_CONTAINERID] missing"

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 6
  • Comments: 18 (13 by maintainers)

Most upvoted comments

1.68.2 Agent and AMIs have been published with the CNI rollback. I confirmed by pulling the latest parameters from us-east-1:

aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/recommended --region us-east-1
...
"Value": "{\"ecs_agent_version\":\"1.68.2\",\"ecs_runtime_version\":\"Docker version 20.10.17\"

We have reverted the CNI plugins update in this PR: https://github.com/aws/amazon-ecs-agent/pull/3565 and are currently working to release it.

Agent 1.68.2 will have a fix for this once it’s ready.

While the above release is still pending, you can update multiple instances in your cluster at once, (or just one,) with the SSM send-command API. This assumes you have the SSM Agent installed and running on your hosts. It is installed by default in the ECS Optimized AMI. You’ll want to find all the instance IDs in your ECS cluster. These will be your targets:

aws ssm send-command \
    --document-name "AWS-RunShellScript" \
    --targets '[{"Key":"InstanceIds","Values":["instance-id-1", "instance-id-2"]}]' \
    --parameters '{"commands":["#!/bin/bash","systemctl stop ecs","docker pull amazon/amazon-ecs-agent:v1.68.0","docker tag amazon/amazon-ecs-agent:v1.68.0 amazon/amazon-ecs-agent:latest","systemctl start ecs","curl -s 127.0.0.1:51678/v1/metadata | python -mjson.tool"]}' \
    --output text

check the output of the above invocation to be sure it has succeeded:

aws ssm list-command-invocations \
    --command-id "<command id from the above send-command>" \
    --details \
    --output text

As a mitigation, you can downgrade your ECS Agent in place First, stop the ecs service:

sudo systemctl stop ecs

then you can pull the previous agent image either from Dockerhub or Public ECR:

docker pull amazon/amazon-ecs-agent:v1.68.0

or

docker pull public.ecr.aws/ecs/amazon-ecs-agent:v1.68.0

Tag that image as latest:

docker tag amazon/amazon-ecs-agent:v1.68.0 amazon/amazon-ecs-agent:latest

now start ecs again and check the running agent version (via the agent introspection API):

sudo systemctl start ecs
curl -s 127.0.0.1:51678/v1/metadata | python -mjson.tool

this should now show

 "Version": "Amazon ECS Agent - v1.68.0