amazon-ecs-agent: Task stuck in pending state

Whenever I push a new task to ECS the new tasks will hanging in a PENDING state. If I SSH in and restart the ECS agent, they transition to RUNNING and the previous tasks transition to STOPPED. Subsequent deploys work, but after some period of time, deploys will cease to work again, and the process has to be repeated. In the ecs agent docker logs I see the following warning:

Anything relevant in the ECS Agent logs (located at /var/log/ecs)

2017-03-13T17:56:06.823635005Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: 6aac647a4737cc592852b4fa3dea0e2670eba47cac12e6783a8225093be66876
2017-03-13T17:56:06.823646905Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: b367545a0e99cf3e834b446772e2fe3828c1e09150c0b6645dcb15278615a64c
2017-03-13T17:56:06.823650772Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: c723a3bc0745f908e330ac04f6cfe35c5e3dbb25d3706a3a9ac0b94f223de178
2017-03-13T17:56:06.823654194Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: 08f87808b72a246c154b271b3309240abda623330b289f409aa16fafe468e440
2017-03-13T17:56:06.823657545Z 2017-03-13T17:56:06Z [WARN] Error getting cpu stats, err: No data in the queue, container: c3ab3d3cadbed5587ffb2a3a31da89266cbc1dfb74ff211431baa93afc244159
2017-03-13T17:56:06.823660881Z 2017-03-13T17:56:06Z [WARN] Error getting instance metrics: No task metrics to report
2017-03-13T17:56:40.274042585Z 2017-03-13T17:56:40Z [INFO] No eligible images for deletion for this cleanup cycle
2017-03-13T17:56:40.276963819Z 2017-03-13T17:56:40Z [INFO] End of eligible images for deletion

Version Info

{
  "agentVersion": "1.14.0",
  "agentHash": "f88e52e",
  "dockerVersion": "DockerVersion: 1.12.6"
}

The agentConnected value when you call DescribeContainerInstances for that instance: "agentConnected": true,

Output of docker info

Containers: 22
 Running: 11
 Paused: 0
 Stopped: 11
Images: 8
Server Version: 1.12.6
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: ext4
 Data file:
 Metadata file:
 Data Space Used: 6.992 GB
 Data Space Total: 23.35 GB
 Data Space Available: 16.36 GB
 Metadata Space Used: 2.765 MB
 Metadata Space Total: 62.91 MB
 Metadata Space Available: 60.15 MB
 Thin Pool Minimum Free Space: 2.335 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.93-RHEL7 (2015-01-28)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge overlay null
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options:
Kernel Version: 4.4.19-29.55.amzn1.x86_64
Operating System: Amazon Linux AMI 2016.09
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: ip-192-168-8-6
ID: 335R:7NKS:FOKJ:ESGT:B3WV:SS5P:P5G6:NPCP:65T3:EZF5:YYH6:WF33
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8

Anything relevant in the Docker daemon log (located at /var/log/docker):

time="2017-03-13T18:17:27.851502907Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:34:48.525604354Z" level=info msg="Container b20a0a06cf43ea625ae2e92b0fcf580a9bca4b163fd519300fde7d4542b99a13 failed to exit within 1200 seconds of signal 15 - using the force"
time="2017-03-13T18:34:48.527962030Z" level=info msg="Container e092600d8e66de0da1255c681fc79827f9439897681ec4178622ab2b2182904f failed to exit within 1200 seconds of signal 15 - using the force"
time="2017-03-13T18:34:48.534819749Z" level=info msg="Container 9a15d3f8504c331af47be01a0af8477acb3d86487694ae2f299b58f275f9ab42 failed to exit within 1200 seconds of signal 15 - using the force"
time="2017-03-13T18:34:48.977125171Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:34:48.977972016Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:34:48.979214737Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:53:39.928005206Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:54:55.906842370Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"
time="2017-03-13T18:54:59.642429960Z" level=error msg="Handler for GET /containers/undefined/json returned error: No such container: undefined"

Output of docker inspect <container> where <container> is the container that appears to be stuck: The docker container does not show up at all. If I call curl 127.0.0.1:51678/v1/tasks I see the task in PENDING with desired state of RUNNING and the dockerId is null

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 39 (13 by maintainers)

Most upvoted comments

I am also seeing this issue fairly consistently now with ECS agent v1.14.3. Docker v17.03.1-ce

I see the agent is able to pull the new image but doesn’t seem to start the new image and will not kill the older image.

@aaithal Thank you for the response! I’m sorry I missed that part, burst balance is indeed exhausted on that volume. We use quite a large gp2 at the moment to get the throughput, I guess better to try dedicated IOPS. We’ve experimented with various storage drivers over the last few years, all with different issues, now use overlay2 to avoid inode exhaustion. Thanks again.

@adnxn I’ve sent you the requested items. Thanks in advance!

Hi @coryodaniel, apologies for the late response here. Thank you for sending these logs. To explain detail what’s happening here, the behavior that you described in your first comment when you created the issue is how the ECS Agent is supposed to function:

Whenever I push a new task to ECS the new tasks will hanging in a PENDING state.

Whenever ECS Agent receives new tasks to start from the ECS backend (PENDING->RUNNING), if it also has other tasks to stop, it won’t start these new tasks until those old tasks are stopped. This is to ensure that resources are actually available to containers in the new task when they come up as not waiting for containers of tasks that are in RUNNING -> STOPPED transition can lead to incorrect allocation of resources on the host.

In the logs that you sent, the agent was to told to stop 4 running/pending tasks and start 4 new tasks. It was waiting for running tasks to stop before starting new ones. The real issue is that it’s taking a very long time for the docker daemon to stop these containers which is why there’s a delay in stopping old tasks and subsequently a delay in starting new tasks. It’s not clear from the docker logs why there’s a delay in stopping these containers. The most common reason for such slowness that we’ve seen is when the EBS volume runs out of IO credits. You can check the Burst Balance metric that can help provide visibility into the credit balance of gp2 volumes. If that’s not the root-cause, then running the docker daemon in debug mode and sending us those logs when this happens again would be the best course of action to help us further diagnose what’s causing this.

If I SSH in and restart the ECS agent, they transition to RUNNING and the previous tasks transition to STOPPED.

This is actually a bug in the ECS Agent that new tasks are being started before the ones that are supposed to be stopped are actually stopped.