amazon-ecs-agent: Error getting message from ws backend

Summary

ECS agent disconnects under heavy load.

Description

When I put my ECS instance under high load, like I scale my container instances from 2 to 12 the ecs agent disconnects with following errors:

2018-03-12T22:58:52Z [DEBUG] ACS activity occurred
2018-03-12T22:58:52Z [WARN] Unable to extend read deadline for ACS connection: set tcp 10.0.2.236:52020: use of closed network connection
2018-03-12T22:58:52Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.0.2.236:52020: use of closed network connection for https://ecs-a-1.eu-west-1.amazonaws.com/ws?agentHash=edc3e260&agentVersion=1.17.2&clusterArn=redacted&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%3A936492824651%3Acontainer-instance%2Fca59a874-33ae-484b-9f51-654c5940b037&dockerVersion=DockerVersion%3A+17.12.0-ce&sendCredentials=true&seqNum=1
2018-03-12T22:58:52Z [ERROR] Error getting message from ws backend: error: [read tcp 10.0.2.236:52020->54.239.32.153:443: use of closed network connection], messageType: [-1]

After that is’s marked as Agent Connected: False in ECS console until I restart the instance.

I’ve got debug logs if you need more info.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 3
  • Comments: 28 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @KIVagant, I have shared the pre-release build of the ECS agent with your account and sent you instructions for using/installing the same over email. Please let us know if that, along with ECS_RESERVED_MEMORY alleviates the issue that you’re running into.

Thanks, Anirudh

100MB seems like a good value. But, please note that its also dependent on your use-case (how many out of band daemons/processes are running on your instance, how much resources they are consuming etc) and instance type.

Hi @combor, @KIVagant, now that #1310 is merged, if you wan to deploy an agent build containing this fix in your test cluster to verify if it helps with the disconnect issue, please send your account IDs and the region where your cluster is deployed to aithal at amazon dot com. We can share a pre-release build of the ECS agent with you, which you can deploy in your test setup to validate the fix.

Thanks, Anirudh

Thanks @sharanyad

A brief background about my setup might be also helpful. The cluster consists of two EC2 instances and I’ve got 7 services and each has two instances of a task. If I scale one of the services to 12 it starts them but after a while all services in the entire cluster are reported as unhealthy ( by ALB ) and killed. Later they stay in the PENDING state. I’ve got another set of logs but now I waited 15mins for reconnection. There’s no reconnection to ACS message at all but the error is:

2018-03-13T20:53:04Z [INFO] Error from tcs; backing off: websocket: close 1002 (protocol error): Channel long idle: No message is received, close the channel
2018-03-13T20:53:15Z [WARN] Error disconnecting: write tcp 10.0.0.141:40894->54.239.36.166:443: i/o timeout

and some nice golang printf 😃

2018-03-13T20:59:05Z [DEBUG] Managed task [arn:aws:ecs:eu-west-1:936492824651:task/3df034ea-246c-422b-bce9-a678f427ff39]: handling container change [{NONE { <nil> [] Could not transition to inspecting; timed out after waiting 30s map[] map[] 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC 0001-01-01 00:00:00 +0000 UTC {UNKNOWN <nil> 0 }} ContainerStatusChangeEvent}] for container [productpayments]

Which email shall I use to send logs?