amazon-ecs-agent: Unavailable Agent with Broken Pipe in Logs
Summary
Summary
We are using Amazon ECS-Optimized Amazon Linux AMI 2017.09.g
and ecs agent 1.17.1
but quite often see Agent Connected: false
in the ECS Cluster ECS Instances dashboard.
This obviously causes issues with deployment.
Description
Today I’ve checked the logs for a box with an false
ecs agent. When I log on to the server it looks like everything stalled yesterday, I can’t see an obvious reason why.
The agent is still running in docker ps and the disks aren’t full but it looks like it just all of the sudden stalled and I don’t see an obvious reason why.
Expected Behavior
Server should be “Agent Connected” “true” on ECS panel
Observed Behavior
Server is “Agent Connected” “false” on ECS panel
Environment Details
[ec2-user@ip-172-X-X-X ecs]$ docker ps | grep agent
1ab37d4df9dc amazon/amazon-ecs-agent:latest "/agent" 3 days ago Up 3 days ecs-agent
[ec2-user@ip-172-X-X-X ecs]$ uname -a
Linux ip-172-X-X-X 4.9.76-3.78.amzn1.x86_64 #1 SMP Fri Jan 12 19:51:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
[ec2-user@ip-172-X-X-X ecs]$ docker info
Containers: 51
Running: 4
Paused: 0
Stopped: 47
Images: 12
Server Version: 17.09.1-ce
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: ext4
Data file:
Metadata file:
Data Space Used: 5.718GB
Data Space Total: 23.33GB
Data Space Available: 17.61GB
Metadata Space Used: 3.568MB
Metadata Space Total: 25.17MB
Metadata Space Available: 21.6MB
Thin Pool Minimum Free Space: 2.333GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.76-3.78.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.455GiB
Name: ip-172-31-3-153
ID: RKXD:IU6K:PRO5:BCQE:QBDE:XDOI:VK5T:YTHH:SY77:3SRX:CZUI:KTND
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Supporting Log Snippets
[ec2-user@ip-172-X-X-X ecs]$ date
Thu Mar 8 19:10:49 UTC 2018
[ec2-user@ip-172-X-X-X ecs]$ ls -ltr /var/log/ecs
total 1552
-rw-r--r-- 1 root root 215 Feb 5 21:38 ecs-init.log.2018-02-05-21
-rw-r--r-- 1 root root 3451 Feb 21 19:26 ecs-init.log.2018-02-21-19
-rw-r--r-- 1 root root 2791 Mar 5 13:53 ecs-init.log
-rw-r--r-- 1 root root 38947 Mar 6 12:58 ecs-agent.log.2018-03-06-12
-rw-r--r-- 1 root root 35704 Mar 6 13:58 ecs-agent.log.2018-03-06-13
-rw-r--r-- 1 root root 30675 Mar 6 14:58 ecs-agent.log.2018-03-06-14
-rw-r--r-- 1 root root 57038 Mar 6 15:58 ecs-agent.log.2018-03-06-15
-rw-r--r-- 1 root root 29341 Mar 6 16:58 ecs-agent.log.2018-03-06-16
-rw-r--r-- 1 root root 64408 Mar 6 17:59 ecs-agent.log.2018-03-06-17
-rw-r--r-- 1 root root 19333 Mar 6 18:59 ecs-agent.log.2018-03-06-18
-rw-r--r-- 1 root root 69912 Mar 6 19:59 ecs-agent.log.2018-03-06-19
-rw-r--r-- 1 root root 57046 Mar 6 20:59 ecs-agent.log.2018-03-06-20
-rw-r--r-- 1 root root 70424 Mar 6 21:59 ecs-agent.log.2018-03-06-21
-rw-r--r-- 1 root root 39473 Mar 6 22:59 ecs-agent.log.2018-03-06-22
-rw-r--r-- 1 root root 49076 Mar 6 23:59 ecs-agent.log.2018-03-06-23
-rw-r--r-- 1 root root 40659 Mar 7 00:59 ecs-agent.log.2018-03-07-00
-rw-r--r-- 1 root root 27373 Mar 7 01:59 ecs-agent.log.2018-03-07-01
-rw-r--r-- 1 root root 49209 Mar 7 02:59 ecs-agent.log.2018-03-07-02
-rw-r--r-- 1 root root 26913 Mar 7 03:59 ecs-agent.log.2018-03-07-03
-rw-r--r-- 1 root root 36195 Mar 7 04:59 ecs-agent.log.2018-03-07-04
-rw-r--r-- 1 root root 50496 Mar 7 05:59 ecs-agent.log.2018-03-07-05
-rw-r--r-- 1 root root 27239 Mar 7 06:59 ecs-agent.log.2018-03-07-06
-rw-r--r-- 1 root root 52714 Mar 7 07:59 ecs-agent.log.2018-03-07-07
-rw-r--r-- 1 root root 81417 Mar 7 08:59 ecs-agent.log.2018-03-07-08
-rw-r--r-- 1 root root 35518 Mar 7 09:59 ecs-agent.log.2018-03-07-09
-rw-r--r-- 1 root root 92999 Mar 7 10:58 ecs-agent.log.2018-03-07-10
-rw-r--r-- 1 root root 98361 Mar 7 11:59 ecs-agent.log.2018-03-07-11
-rw-r--r-- 1 root root 254023 Mar 7 12:32 ecs-agent.log.2018-03-07-12
[ec2-user@ip-172-X-X-X ecs]$ tail -n 3 ecs-agent.log.2018-03-07-12
2018-03-07T12:32:11Z [INFO] Task engine [arn:aws:ecs:eu-west-1:XXXXXXXXXXX:task/fe77ee7c-6a94-407f-ae55-XXXXXXXXXXX]: created docker container for task: cr-python -> fb2196d2b40de440df1779c8fcdc13d78c07fc6db4d4d24bb63f79538c5f1765
2018-03-07T12:32:11Z [INFO] Managed task [arn:aws:ecs:eu-west-1:XXXXXXXXXXX:task/fe77ee7c-6a94-407f-ae55-XXXXXXXXXXX]: unable to create task state change event []: create task state change event api: status not recognized by ECS: CREATED
2018-03-07T12:32:11Z [INFO] Managed task [arn:aws:ecs:eu-west-1:XXXXXXXXXXX:task/fe77ee7c-6a94-407f-ae55-XXXXXXXXXXX]: Cgroup resource set up for task complete
[ec2-user@ip-172-X-X-X ecs]$ tail -n 10 /var/log/docker
time="2018-03-05T13:53:36.769029714Z" level=error msg="attach failed with error: write unix /var/run/docker.sock->@: write: broken pipe"
time="2018-03-06T10:58:49.063197272Z" level=info msg="Container fc8a093d9651861aef89f0af1af11f12c236f609594be800f72954b539222ec7 failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T10:16:01.716471056Z" level=info msg="Container f35e107904a68a8751bda367ebac743504efbdf51516f086817ed37ee878e75b failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T11:57:57.865250713Z" level=info msg="Container c725342bcc3428862e4cebcbdf6963c6baea04aeefd529058a445f0845df21d1 failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T11:58:11.185701623Z" level=info msg="Container 9a85c76d894cd88b4979ee84fe643865799404f930be60f3f7a10563968d05e4 failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T12:00:27.962312001Z" level=info msg="Container 33261909ea5f975c275ef20efa109f152b53e76dbe331e21618f355c5bd6d5d6 failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T12:00:49.711890043Z" level=info msg="Container 6b7411701c14d10c8d27008dcbc24e67650ea4ffb5bc48ed43f5adc00643f3b5 failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T12:00:49.711901581Z" level=info msg="Container 4950f58c0cd263163f9e15e72b44234c3edf52bc400b2dbdbc8b53f0c3b3d73c failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T12:00:49.713797412Z" level=info msg="Container b75d5b1d76de9fa000da8569c3de9657c551417633eedc69f14648fb08b9beb6 failed to exit within 30 seconds of signal 15 - using the force"
time="2018-03-07T12:00:49.714540086Z" level=info msg="Container 1c10e69babd43e8963a9e00b648e0eaae5fce9c496f785de99b1640c24aaefd4 failed to exit within 30 seconds of signal 15 - using the force"
[ec2-user@ip-172-X-X-X ecs]$ sudo lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
docker-pool docker twi-aot--- 21.73g 24.51 14.18
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 16 (2 by maintainers)
Confirming that that’s the one you’d need to monitor.
Also, thanks for getting back to us confirming the hypothesis. I’ll close this issue for now. Feel free to create a new one if you have any followup questions/comments.
Thanks, Anirudh
Hi @andy-plutoflume, thanks for sending those logs. I looked at the utilization statistics of your instance and it seems like you’re running out of EBS IO credits. Looking at past 15 days, it seems like its happened on 3 distinct occasions (all times are UTC):
If you’re running IO intensive workloads, can you please make sure that you are either using larger volumes or provisioned IOPS so as to not run into this issue? Here’s a write up of how you can look at these metrics with some recommendations on remediating the same: New – Burst Balance Metric for EC2’s General Purpose SSD (gp2) Volumes.
Not being able to do any IO cripples Docker daemon and any container/process depending on it. Please let us know if this helps with your issue.
Thanks, Anirudh