amazon-ecs-agent: Stopping fluent-bit (in firelens configuration) hangs the application container
Summary
Essential container is not stopping if you kill fluent-bit container in the same task definition
Description
I am running a ECS service with 2 containers in the task. One is the main nginx container and other is fluent-bit log-collector. Both containers are essential containers. When the log-collector container is stopped manually, it doesn’t stop the nginx container. Checked the ECS agent and docker logs, saw that the SIGTERM and SIGKILL is passed to the main nginx container. The nginx container process on the host machine also stops. But the docker ps shows the nginx container Up. The task as well shows it Running.
Steps to replicate - Use following task definition -
{
"family": "firelens-http",
"taskRoleArn": "arn:aws:iam::919027951404:role/ECSTaskRoleForCW",
"containerDefinitions":[
{
"name":"continuelogs",
"image":"httpd",
"essential":true,
"logConfiguration": {
"logDriver":"awsfirelens",
"options": {
"Name": "cloudwatch",
"region": "ap-southeast-2",
"log_group_name": "firelens-fluent-bit",
"auto_create_group": "true",
"log_stream_prefix": "from-fluent-bit"
}
},
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
],
"memory": 50,
"cpu": 102
},
{
"name":"log_router",
"essential":true,
"image":"amazon/aws-for-fluent-bit:latest",
"firelensConfiguration":{
"type":"fluentbit"
},
"logConfiguration":{
"logDriver":"awslogs",
"options":{
"awslogs-group":"firelens-container",
"awslogs-region":"ap-southeast-2",
"awslogs-create-group":"true",
"awslogs-stream-prefix":"firelens"
}
},
"memoryReservation":50
}
]
}
And then I ran the curl on the HTTPD container so that it can generate constant logs using the following bash script -
while [ 1 ]; do curl <ECS instance IP> done
then I did - docker stop log-collector The other container stuck in the wired state like customer’s.
Expected Behavior
Both containers should stop. docker ps should not show any running containers.
Observed Behavior
Only fluent-bit container stops. Application container(nginx) still seen running with docker ps and even on the ECS console.
Environment Details
Supporting Log Snippets
Logs from the dockerd daemon -
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.746807408Z" level=info msg="Container 5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be failed to exit within 1500 seconds of signal 15 - using the force"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.746844525Z" level=debug msg="Sending kill signal 9 to container 5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.750416996Z" level=debug msg="Running health check for container a1bea68e06e37cd33500158048bddda96c4f9d96ced56b5fcc3ac4a275c7aa56 ..."
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.750616958Z" level=debug msg="starting exec command 2670ffe75a8b9ae084b0894c9367c86fe0687d4e6660c37a76280c6da8852946 in container a1bea68e06e37cd33500158048bddda96c4f9d96ced56b5fcc3ac4a275c7aa56"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.752485060Z" level=debug msg="attach: stdout: begin"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.752526920Z" level=debug msg="attach: stderr: begin"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.802146328Z" level=debug msg="Client context cancelled, stop sending events"
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal dockerd[32588]: time="2020-09-18T00:34:06.847525798Z" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
Sep 18 00:34:06 ip-10-75-67-189.us-west-2.compute.internal containerd[2972]: time="2020-09-18T00:34:06.864594387Z" level=info msg="shim reaped" id=5979c532a446d0256e7a809a9d137a46dfd8cd32c70aa1ddce01d1cee8f494be
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 30 (23 by maintainers)
@singholt Your comment says 20.10.3 but I think you mean 20.10.13
Yay they finally released it! https://docs.docker.com/engine/release-notes/#201013
I’m attempting to get the Docker maintainers to agree to backporting this to 20.10 branch which means it could be released much sooner: https://github.com/moby/moby/pull/43147
@fschollmeyer Fluent bit failing to start sounds like a separate issue?
My understanding of this issue is that its only triggered once the app container and firelens/fluent bit container are connected via the fluentd docker log driver and logs start to flow, and then the fluent bit container is stopped. @fenxiong /anyone- is my understanding correct here?
If you’re having trouble with Fluent Bit, and you use the AWS Distro, please open an issue here and we will help you: https://github.com/aws/aws-for-fluent-bit
This confuses me a bit- can you elaborate? If you are running in ECS, the EC2 instance itself is not what triggers container restarts. The ECS service can re-schedule tasks if they fail to start. Sounds like that’s what’s happening in your case? Or are you seeing that the task is restarted but the main container from each task never stops and thus you get more and more instances of the app container?