amazon-ecs-agent: Error getting message from ws backend
Summary
ECS agent disconnects under heavy load.
Description
When I put my ECS instance under high load, like I scale my container instances from 2 to 12 the ecs agent disconnects with following errors:
2018-03-12T22:58:52Z [DEBUG] ACS activity occurred
2018-03-12T22:58:52Z [WARN] Unable to extend read deadline for ACS connection: set tcp 10.0.2.236:52020: use of closed network connection
2018-03-12T22:58:52Z [WARN] Unable to set read deadline for websocket connection: set tcp 10.0.2.236:52020: use of closed network connection for https://ecs-a-1.eu-west-1.amazonaws.com/ws?agentHash=edc3e260&agentVersion=1.17.2&clusterArn=redacted&containerInstanceArn=arn%3Aaws%3Aecs%3Aeu-west-1%3A936492824651%3Acontainer-instance%2Fca59a874-33ae-484b-9f51-654c5940b037&dockerVersion=DockerVersion%3A+17.12.0-ce&sendCredentials=true&seqNum=1
2018-03-12T22:58:52Z [ERROR] Error getting message from ws backend: error: [read tcp 10.0.2.236:52020->54.239.32.153:443: use of closed network connection], messageType: [-1]
After that is’s marked as Agent Connected: False
in ECS console until I restart the instance.
I’ve got debug logs if you need more info.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 3
- Comments: 28 (14 by maintainers)
Commits related to this issue
- wsclient:connection closed error in Set*Deadline() Handle "connection closed" error in SetReadDeadline and SetWriteDeadline methods. The strategy is to treat these errors as terminal errors for the c... — committed to aaithal/amazon-ecs-agent by aaithal 6 years ago
- wsclient:connection closed error in Set*Deadline() Handle "connection closed" error in SetReadDeadline and SetWriteDeadline methods. The strategy is to treat these errors as terminal errors for the c... — committed to aaithal/amazon-ecs-agent by aaithal 6 years ago
Hi @KIVagant, I have shared the pre-release build of the ECS agent with your account and sent you instructions for using/installing the same over email. Please let us know if that, along with
ECS_RESERVED_MEMORY
alleviates the issue that you’re running into.Thanks, Anirudh
100MB
seems like a good value. But, please note that its also dependent on your use-case (how many out of band daemons/processes are running on your instance, how much resources they are consuming etc) and instance type.Hi @combor, @KIVagant, now that #1310 is merged, if you wan to deploy an agent build containing this fix in your test cluster to verify if it helps with the disconnect issue, please send your account IDs and the region where your cluster is deployed to aithal at amazon dot com. We can share a pre-release build of the ECS agent with you, which you can deploy in your test setup to validate the fix.
Thanks, Anirudh
Thanks @sharanyad
A brief background about my setup might be also helpful. The cluster consists of two EC2 instances and I’ve got 7 services and each has two instances of a task. If I scale one of the services to 12 it starts them but after a while all services in the entire cluster are reported as unhealthy ( by ALB ) and killed. Later they stay in the PENDING state. I’ve got another set of logs but now I waited 15mins for reconnection. There’s no reconnection to ACS message at all but the error is:
and some nice golang printf 😃
Which email shall I use to send logs?