amazon-ecs-agent: Tasks get frequently stuck in "PENDING" state since updating from v1.17.0 to v1.17.2
Summary
Tasks are occasionally stuck (~1/7 or ~1/8 roughly) in a PENDING state on the ECS console.
Description
Ever since updating our ECS-optimized AMI from amzn-ami-2017.09.h-amazon-ecs-optimized
to amzn-ami-2017.09.j-amazon-ecs-optimized
(and therefore, the ECS agent from v1.17.0
to v1.17.2
), tasks apparently randomly get stuck in a PENDING state on the ECS console. The task itself that is trying to start does not appear to be relevant, as it happens on multiple unrelated services that were previously running smoothly, and on multiple AWS accounts where the only change in common was the AMI update.
Going onto any instance where a PENDING task is supposed to be starting on and running a docker ps -a
shows no sign of the task having ever started, though other tasks that started up successfully are there.
Manually stopping the task and letting ECS re-create a new task does usually fix the issue, though sometimes the new task can also get stuck, even if it is on a different instance.
Expected Behavior
Tasks are able to move from a “PENDING” state to a “RUNNING” state consistently
Observed Behavior
They do not.
Environment Details
AWS region - eu-west-1
Spot fleet instances of various sizes.
Supporting Log Snippets
The following aws agent logs are related to a stuck task with id: 0341c73f-d125-4920-bbe2-2ceb7b3fe00a
, which the console suggests was created at 2018-03-13 14:45:52 +0000
.
A look into other instances with the new AMI show similar sets of warnings and errors, but I’m happy to provide more if it helps figure out what is going on!
2018-03-13T14:45:42Z [INFO] Connected to TCS endpoint
2018-03-13T14:45:42Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:45:42Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:45:42Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:45:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:45:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:45:56Z [INFO] Saving state! module="statemanager"
2018-03-13T14:45:56Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/0341c73f-d125-4920-bbe2-2ceb7b3fe00a]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
2018-03-13T14:45:56Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/0341c73f-d125-4920-bbe2-2ceb7b3fe00a]: waiting for any previous stops to complete. Sequence number: 1200
2018-03-13T14:45:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:46:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:02Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:46:02Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:46:02Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:46:02Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:46:06Z [INFO] Saving state! module="statemanager"
2018-03-13T14:46:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:22Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:46:22Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:46:22Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:46:22Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:46:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:46:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:46:27Z [INFO] Saving state! module="statemanager"
2018-03-13T14:46:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: Cgroup resource set up for task complete
2018-03-13T14:46:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: stopping container [FelixContainer]
2018-03-13T14:46:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:37Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-03-13T14:46:37Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-03-13T14:46:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:46:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:46:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:47:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:47:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:47:27Z [INFO] Saving state! module="statemanager"
2018-03-13T14:47:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: Cgroup resource set up for task complete
2018-03-13T14:47:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: stopping container [FelixContainer]
2018-03-13T14:47:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:39Z [INFO] Connected to TCS endpoint
2018-03-13T14:47:39Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:47:39Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:47:39Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:47:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:48Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/a58439f2-352b-4e7e-b99d-01a0a17521aa]: redundant container state change. ArticleMetadataAPI to RUNNING, but already RUNNING
2018-03-13T14:47:48Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/1beac7a2-93d6-4665-8fb4-5220f654fbc6]: redundant container state change. ElasticContainer to RUNNING, but already RUNNING
2018-03-13T14:47:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:47:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:47:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:47:59Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:47:59Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:47:59Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:47:59Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:48:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:19Z [WARN] Error getting cpu stats, err: No data in the queue, container: 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd
2018-03-13T14:48:19Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c
2018-03-13T14:48:19Z [WARN] Error getting cpu stats, err: No data in the queue, container: b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07
2018-03-13T14:48:19Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2018-03-13T14:48:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:48:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:48:27Z [INFO] Saving state! module="statemanager"
2018-03-13T14:48:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: Cgroup resource set up for task complete
2018-03-13T14:48:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: stopping container [FelixContainer]
2018-03-13T14:48:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:31Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:34Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1]
2018-03-13T14:48:34Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-03-13T14:48:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:41Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:48:51Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:48:57Z [INFO] DockerGoClient: error stopping container b7a94649806e335642cb9756bd31c5b8f4b6e6fddd92617c85e0e609f56c2a07: context deadline exceeded
2018-03-13T14:49:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:49:01Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:49:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:49:11Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:49:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 461e7c97a02968d5fee4da5bbf4daf3043c973496590dabaab6ab5397cb479fd: inactivity time exceeded timeout
2018-03-13T14:49:21Z [INFO] DockerGoClient: Unable to retrieve stats for container 8a4439f914fa9c537871977b911cd807f4c10446f96ad47796d199983b78763c: inactivity time exceeded timeout
2018-03-13T14:49:27Z [INFO] Task engine [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: error transitioning container [FelixContainer] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:49:27Z [INFO] Managed task [arn:aws:ecs:eu-west-1:<ACCOUNT_ID>:task/7cdd5e35-53df-40f3-ab60-d6a2f71ba490]: 'DockerTimeoutError' error stopping container [FelixContainer]. Ignoring state change: Could not transition to stopped; timed out after waiting 1m0s
2018-03-13T14:49:27Z [INFO] Saving state! module="statemanager"
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (7 by maintainers)
Are there any plans to release a new, official AMI that downgrades the docker version but keeps the new ecs agent version until a proper fix can be found?
The latest ECS Optimized AMI (2017.09.j) includes an update to the Docker daemon bumping it to
17.12.0-ce
. The Docker upstream maintainers are aware of the issue and are tracking it in https://github.com/moby/moby/issues/35933 and they are cutting a release to address the bug.Edit: to clarify, the upstream has what looks like a partial fix, we will need to test and confirm.