amazon-ecs-agent: ECS agents stops and starts the tasks

Summary

We are experiencing an issue with one of the ECS cluster deployed using cloud formation in our production network. We have a service which runs two tasks. One of the tasks running in a container instance is stopped by ECS agent and started after some time.

Description

In the log, you can see container is stopped

2019-06-20T18:09:36Z [INFO] Error from tcs; backing off: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:38Z [INFO] Connected to ACS endpoint
2019-06-20T18:09:40Z [INFO] Saving state! module="statemanager"
2019-06-20T18:09:40Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: stopping container [DES-desg]

Can you please explain this behavior of ECS?

There is no change in container image definition and there are no errors in our docker container and the container was running fine, but this issue happens suddenly and consistently over the period of time.

Expected Behavior

ECS agent doesnt stop the container

Observed Behavior

ECS stops the container

Environment Details

Docker Server Version: 18.06.1-ce ECS Agent version 1.20.3

Supporting Log Snippets

2019-06-20T18:02:51Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2019-06-20T18:02:51Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2019-06-20T18:02:51Z [WARN] Error publishing metrics: write tcp 10.66.9.203:47724->10.51.177.169:8080: i/o timeout
2019-06-20T18:02:51Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:02:51Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2019-06-20T18:02:52Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:02:52Z [INFO] Connected to TCS endpoint
2019-06-20T18:02:52Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:03:52Z [ERROR] Error getting message from ws backend: error: [websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel], messageType: [-1] 
2019-06-20T18:04:22Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2019-06-20T18:04:58Z [WARN] Error publishing metrics: write tcp 10.66.9.203:36502->10.51.176.207:8080: i/o timeout
2019-06-20T18:05:26Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: task at steady state: RUNNING
2019-06-20T18:05:44Z [WARN] ACS Connection hasn't had any activity for too long; closing connection
2019-06-20T18:05:59Z [INFO] Disconnected from ACS
2019-06-20T18:06:13Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:06:35Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:44018: use of closed network connection for https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:06:49Z [ERROR] Stopping redundant reads on closed network connection: https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:06:56Z [WARN] Unable to extend read deadline for ACS connection: set tcp 10.66.9.203:44018: use of closed network connection
2019-06-20T18:07:00Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: redundant container state change. DES-desg to NONE, but already RUNNING
2019-06-20T18:07:00Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:44018: use of closed network connection for https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:01Z [WARN] DockerGoClient: inactivity time exceeded timeout while retrieving stats for container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:07:02Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: task at steady state: RUNNING
2019-06-20T18:07:02Z [ERROR] Stopping redundant reads on closed network connection: https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:02Z [INFO] Reconnecting to ACS in: 253.312033ms
2019-06-20T18:07:19Z [INFO] Establishing a Websocket connection to https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:07:56Z [WARN] Error creating a websocket client: dial tcp: i/o timeout
2019-06-20T18:08:02Z [ERROR] Error connecting to ACS: websocket client: unable to dial ecs-a-6.us-east-1.amazonaws.com response: : dial tcp: i/o timeout
2019-06-20T18:08:10Z [INFO] Reconnecting to ACS in: 428.558079ms
2019-06-20T18:08:57Z [INFO] Establishing a Websocket connection to https://ecs-a-6.us-east-1.amazonaws.com/ws?agentHash=0821fbc7&agentVersion=1.25.2&clusterArn=DES-E1-USEZ&containerInstanceArn=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a&dockerVersion=DockerVersion%3A+18.06.1-ce&sendCredentials=false&seqNum=1
2019-06-20T18:09:36Z [WARN] Error creating a websocket client: write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:36Z [ERROR] Error connecting to TCS: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:36Z [INFO] Error from tcs; backing off: websocket client: unable to dial ecs-t-6.us-east-1.amazonaws.com response: : write tcp 10.66.9.203:46356->10.51.176.5:8080: i/o timeout
2019-06-20T18:09:38Z [INFO] Connected to ACS endpoint
2019-06-20T18:09:40Z [INFO] Saving state! module="statemanager"
2019-06-20T18:09:40Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: stopping container [DES-desg]
2019-06-20T18:09:40Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/c838234c-e247-418d-9ffe-f3536231c187]: unable to create task state change event []: create task state change event api: status not recognized by ECS: NONE
2019-06-20T18:09:40Z [INFO] Managed task [arn:aws:ecs:us-east-1:361190373704:task/c838234c-e247-418d-9ffe-f3536231c187]: waiting for any previous stops to complete. Sequence number: 5
2019-06-20T18:09:41Z [INFO] Establishing a Websocket connection to https://ecs-t-6.us-east-1.amazonaws.com/ws?cluster=DES-E1-USEZ&containerInstance=arn%3Aaws%3Aecs%3Aus-east-1%3A361190373704%3Acontainer-instance%2F575a2c1d-ff75-4cfc-8643-1ac0481f2f3a
2019-06-20T18:09:43Z [INFO] Connected to TCS endpoint
2019-06-20T18:09:49Z [WARN] DockerGoClient: Unable to decode stats for container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670: context canceled
2019-06-20T18:09:53Z [INFO] Saving state! module="statemanager"
2019-06-20T18:10:33Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:10:39Z [WARN] Error publishing metrics: stats engine: no task metrics to report
2019-06-20T18:10:50Z [INFO] Task engine [arn:aws:ecs:us-east-1:361190373704:task/63bae6e8-b975-4661-b67e-ae06b9e31327]: error transitioning container [DES-desg] to [STOPPED]: Could not transition to stopped; timed out after waiting 1m0s
2019-06-20T18:10:58Z [INFO] TCS Connection hasn't had any activity for too long; disconnecting
2019-06-20T18:11:06Z [INFO] DockerGoClient: error stopping container 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2019-06-20T18:11:13Z [WARN] Error getting cpu stats, err: No data in the queue, container: 8133d8c0371ab05231a9eefd8c27deeb54311c81be5c131670536cbbcf493670
2019-06-20T18:11:20Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.66.9.203:46394: use of closed network connection for https://ecs-t-6.us-east-1.amazonaws.com/

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 40 (9 by maintainers)

Most upvoted comments

I’m closing this as a duplicate; we are working this in mohit3081989’s ecs-init issue here: https://github.com/aws/amazon-ecs-init/issues/258

@dineshkm Its tricky to determine what is happening from the logs you are providing. There isn’t anything regarding the task lifecycle.

2019-03-19T05:41:09Z [ERROR] Error getting message from ws backend: error: [read tcp 10.72.72.180:50924->10.72.78.46:8080: use of closed network connection], messageType: [-1]

Lines like this typically don’t indicate that something is wrong. The websocket connections close and reset periodically. Agent itself won’t terminate tasks / containers if it loses a connection.

The earlier logs you provided (the connect timeouts) typically mean that there is a security group misconfigured. You said your config hasn’t changed – so we can investigate on our end and see if theres other potential ways to get that same error message.

Also, what about the containers logs? what do they say?

This is another place you should look. See if your application is exiting with any errors. Also, what is the reported exitCode when you call describe task on a stopped task?