amazon-ecs-agent: Unavailable Agent with Broken Pipe in Logs

Summary

We are using Amazon ECS-Optimized Amazon Linux AMI 2017.09.g and ecs agent 1.17.1 but quite often see Agent Connected: false in the ECS Cluster ECS Instances dashboard.

This obviously causes issues with deployment.

Description

Today I’ve checked the logs for a box with an false ecs agent. When I log on to the server it looks like everything stalled yesterday, I can’t see an obvious reason why.

The agent is still running in docker ps and the disks aren’t full but it looks like it just all of the sudden stalled and I don’t see an obvious reason why.

Expected Behavior

Server should be “Agent Connected” “true” on ECS panel

Observed Behavior

Server is “Agent Connected” “false” on ECS panel

Environment Details


[ec2-user@ip-172-X-X-X ecs]$ docker ps | grep agent
1ab37d4df9dc        amazon/amazon-ecs-agent:latest                             "/agent"                 3 days ago          Up 3 days                                     ecs-agent

[ec2-user@ip-172-X-X-X ecs]$ uname -a
Linux ip-172-X-X-X 4.9.76-3.78.amzn1.x86_64 #1 SMP Fri Jan 12 19:51:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


[ec2-user@ip-172-X-X-X ecs]$ docker info
Containers: 51
 Running: 4
 Paused: 0
 Stopped: 47
Images: 12
Server Version: 17.09.1-ce
Storage Driver: devicemapper
 Pool Name: docker-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 10.74GB
 Backing Filesystem: ext4
 Data file: 
 Metadata file: 
 Data Space Used: 5.718GB
 Data Space Total: 23.33GB
 Data Space Available: 17.61GB
 Metadata Space Used: 3.568MB
 Metadata Space Total: 25.17MB
 Metadata Space Available: 21.6MB
 Thin Pool Minimum Free Space: 2.333GB
 Udev Sync Supported: true
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.76-3.78.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.455GiB
Name: ip-172-31-3-153
ID: RKXD:IU6K:PRO5:BCQE:QBDE:XDOI:VK5T:YTHH:SY77:3SRX:CZUI:KTND
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Supporting Log Snippets

[ec2-user@ip-172-X-X-X ecs]$ date                             
Thu Mar  8 19:10:49 UTC 2018

[ec2-user@ip-172-X-X-X ecs]$ ls -ltr /var/log/ecs
total 1552
-rw-r--r-- 1 root root    215 Feb  5 21:38 ecs-init.log.2018-02-05-21
-rw-r--r-- 1 root root   3451 Feb 21 19:26 ecs-init.log.2018-02-21-19
-rw-r--r-- 1 root root   2791 Mar  5 13:53 ecs-init.log
-rw-r--r-- 1 root root  38947 Mar  6 12:58 ecs-agent.log.2018-03-06-12
-rw-r--r-- 1 root root  35704 Mar  6 13:58 ecs-agent.log.2018-03-06-13
-rw-r--r-- 1 root root  30675 Mar  6 14:58 ecs-agent.log.2018-03-06-14
-rw-r--r-- 1 root root  57038 Mar  6 15:58 ecs-agent.log.2018-03-06-15
-rw-r--r-- 1 root root  29341 Mar  6 16:58 ecs-agent.log.2018-03-06-16
-rw-r--r-- 1 root root  64408 Mar  6 17:59 ecs-agent.log.2018-03-06-17
-rw-r--r-- 1 root root  19333 Mar  6 18:59 ecs-agent.log.2018-03-06-18
-rw-r--r-- 1 root root  69912 Mar  6 19:59 ecs-agent.log.2018-03-06-19
-rw-r--r-- 1 root root  57046 Mar  6 20:59 ecs-agent.log.2018-03-06-20
-rw-r--r-- 1 root root  70424 Mar  6 21:59 ecs-agent.log.2018-03-06-21
-rw-r--r-- 1 root root  39473 Mar  6 22:59 ecs-agent.log.2018-03-06-22
-rw-r--r-- 1 root root  49076 Mar  6 23:59 ecs-agent.log.2018-03-06-23
-rw-r--r-- 1 root root  40659 Mar  7 00:59 ecs-agent.log.2018-03-07-00
-rw-r--r-- 1 root root  27373 Mar  7 01:59 ecs-agent.log.2018-03-07-01
-rw-r--r-- 1 root root  49209 Mar  7 02:59 ecs-agent.log.2018-03-07-02
-rw-r--r-- 1 root root  26913 Mar  7 03:59 ecs-agent.log.2018-03-07-03
-rw-r--r-- 1 root root  36195 Mar  7 04:59 ecs-agent.log.2018-03-07-04
-rw-r--r-- 1 root root  50496 Mar  7 05:59 ecs-agent.log.2018-03-07-05
-rw-r--r-- 1 root root  27239 Mar  7 06:59 ecs-agent.log.2018-03-07-06
-rw-r--r-- 1 root root  52714 Mar  7 07:59 ecs-agent.log.2018-03-07-07
-rw-r--r-- 1 root root  81417 Mar  7 08:59 ecs-agent.log.2018-03-07-08
-rw-r--r-- 1 root root  35518 Mar  7 09:59 ecs-agent.log.2018-03-07-09
-rw-r--r-- 1 root root  92999 Mar  7 10:58 ecs-agent.log.2018-03-07-10
-rw-r--r-- 1 root root  98361 Mar  7 11:59 ecs-agent.log.2018-03-07-11
-rw-r--r-- 1 root root 254023 Mar  7 12:32 ecs-agent.log.2018-03-07-12

[ec2-user@ip-172-X-X-X ecs]$ tail -n 3 ecs-agent.log.2018-03-07-12
2018-03-07T12:32:11Z [INFO] Task engine [arn:aws:ecs:eu-west-1:XXXXXXXXXXX:task/fe77ee7c-6a94-407f-ae55-XXXXXXXXXXX]: created docker container for task: cr-python -> fb2196d2b40de440df1779c8fcdc13d78c07fc6db4d4d24bb63f79538c5f1765
2018-03-07T12:32:11Z [INFO] Managed task [arn:aws:ecs:eu-west-1:XXXXXXXXXXX:task/fe77ee7c-6a94-407f-ae55-XXXXXXXXXXX]: unable to create task state change event []: create task state change event api: status not recognized by ECS: CREATED
2018-03-07T12:32:11Z [INFO] Managed task [arn:aws:ecs:eu-west-1:XXXXXXXXXXX:task/fe77ee7c-6a94-407f-ae55-XXXXXXXXXXX]: Cgroup resource set up for task complete



[ec2-user@ip-172-X-X-X ecs]$ tail -n 10 /var/log/docker 
time="2018-03-05T13:53:36.769029714Z" level=error msg="attach failed with error: write unix /var/run/docker.sock->@: write: broken pipe" 
time="2018-03-06T10:58:49.063197272Z" level=info msg="Container fc8a093d9651861aef89f0af1af11f12c236f609594be800f72954b539222ec7 failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T10:16:01.716471056Z" level=info msg="Container f35e107904a68a8751bda367ebac743504efbdf51516f086817ed37ee878e75b failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T11:57:57.865250713Z" level=info msg="Container c725342bcc3428862e4cebcbdf6963c6baea04aeefd529058a445f0845df21d1 failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T11:58:11.185701623Z" level=info msg="Container 9a85c76d894cd88b4979ee84fe643865799404f930be60f3f7a10563968d05e4 failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T12:00:27.962312001Z" level=info msg="Container 33261909ea5f975c275ef20efa109f152b53e76dbe331e21618f355c5bd6d5d6 failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T12:00:49.711890043Z" level=info msg="Container 6b7411701c14d10c8d27008dcbc24e67650ea4ffb5bc48ed43f5adc00643f3b5 failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T12:00:49.711901581Z" level=info msg="Container 4950f58c0cd263163f9e15e72b44234c3edf52bc400b2dbdbc8b53f0c3b3d73c failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T12:00:49.713797412Z" level=info msg="Container b75d5b1d76de9fa000da8569c3de9657c551417633eedc69f14648fb08b9beb6 failed to exit within 30 seconds of signal 15 - using the force" 
time="2018-03-07T12:00:49.714540086Z" level=info msg="Container 1c10e69babd43e8963a9e00b648e0eaae5fce9c496f785de99b1640c24aaefd4 failed to exit within 30 seconds of signal 15 - using the force" 



[ec2-user@ip-172-X-X-X ecs]$ sudo lvs
  LV          VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool docker twi-aot--- 21.73g             24.51  14.18

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 16 (2 by maintainers)

Most upvoted comments

Can you confirm that BurstBalance on EBS is the metric I need to monitor?

Confirming that that’s the one you’d need to monitor.

Also, thanks for getting back to us confirming the hypothesis. I’ll close this issue for now. Feel free to create a new one if you have any followup questions/comments.

Thanks, Anirudh

aaithal on Mar 22, 2018

Hi @andy-plutoflume, thanks for sending those logs. I looked at the utilization statistics of your instance and it seems like you’re running out of EBS IO credits. Looking at past 15 days, it seems like its happened on 3 distinct occasions (all times are UTC):

Between 03/07 00:00 and 03/09 09:00
Between 03/15 03:00 and 03/15 17:00
Between 03/21 06:00 and 03/21 14:00

If you’re running IO intensive workloads, can you please make sure that you are either using larger volumes or provisioned IOPS so as to not run into this issue? Here’s a write up of how you can look at these metrics with some recommendations on remediating the same: New – Burst Balance Metric for EC2’s General Purpose SSD (gp2) Volumes.

Not being able to do any IO cripples Docker daemon and any container/process depending on it. Please let us know if this helps with your issue.

Thanks, Anirudh

aaithal on Mar 21, 2018