aws-for-fluent-bit: Stops sending logs after connection/tls failure
Describe the question/issue
The aws-for-fluent-bit log router stops sending logs to Loki through a HTTPS proxy after a connection/tls failure. The container sometimes exits shortly after and doesn’t have anything in its log to indicate why. This causes the entire ECS task to restart because I have the log router container essential=true so that we don’t lose logs for a long period of time.
I have searched the issues here and in the fluent-bit repo. I have also searched the Grafana and Fluent slack communities.
Configuration
Deployment:
- ECS cluster (not Fargate)
- Primary container is using
awsfirelensto route logs to theaws-for-fluent-bitcontainer - The
aws-for-fluent-bitcontainer is routing logs to a loki task in the same cluster - The loki task is using S3 for storage
Relevant parts of ECS task definition. The first container is the web app and the second is the log router:
{
"containerDefinitions": [
{
"logConfiguration": {
"logDriver": "awsfirelens",
"options": {
"Name": "loki",
"host": "loki-qa.elimuinformatics.com",
"port": "443",
"tls": "On",
"http_user": "loki",
"http_passwd": "<hidden>",
"net.keepalive": "false",
"workers": "1",
"Retry_Limit": "5",
"labels": "env=qa,service=sapphire-web",
"label_keys": "$ecs_task_definition,$ec2_instance_id",
"remove_keys": "container_id,container_name,ecs_task_arn",
"line_format": "key_value"
}
},
"memory": 4096,
"memoryReservation": 256,
"dependsOn": [
{
"containerName": "sapphire-web-log-router",
"condition": "HEALTHY"
}
],
"essential": true,
"name": "sapphire-web-qa"
},
{
"logConfiguration": {
"logDriver": "awslogs",
"secretOptions": null,
"options": {
"awslogs-group": "/aws/ecs/firelens/sapphire-qa",
"awslogs-region": "us-west-2",
"awslogs-create-group": "true",
"awslogs-stream-prefix": "firelens"
}
},
"environment": [
{
"name": "FLB_LOG_LEVEL",
"value": "debug"
}
],
"memoryReservation": 50,
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"config-file-type": "file",
"enable-ecs-log-metadata": "true",
"config-file-value": "/extra.conf"
}
},
"healthCheck": {
"retries": 2,
"command": [
"CMD-SHELL",
"curl -f http://127.0.0.1:2020/api/v1/uptime || exit 1"
],
"timeout": 5,
"interval": 10,
"startPeriod": 30
},
"essential": true,
"name": "sapphire-web-log-router"
}
]
}
The extra.conf file contains:
[INPUT]
Name forward
unix_path /var/run/fluent.sock
Mem_Buf_Limit 2MB
[SERVICE]
Flush 5
Grace 30
# Healh check setup
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Health_Check On
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 30
Fluent Bit Log Output
Here’s a partial log file where the error starts, the container fails to send any more logs (even on the retries), and then exits - killing the entire task because I have essential=true.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| timestamp | message |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <snip> |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [task] created task=0x7f35e0a4a200 id=0 OK |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660308990296 | [2022/08/12 12:56:30] [error] [tls] error: unexpected EOF |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443 |
| 1660308990296 | [2022/08/12 12:56:30] [error] [output:loki:loki.1] no upstream connections available |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [retry] new retry created for task_id=0 attempts=1 |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [out flush] cb_destroy coro_id=326 |
| 1660308990296 | [2022/08/12 12:56:30] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 6 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660308991749 | [2022/08/12 12:56:31] [debug] [input chunk] update output instances with new chunk size diff=638 |
| 1660308992333 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660308992689 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=672 |
| 1660308993279 | [2022/08/12 12:56:33] [debug] [input chunk] update output instances with new chunk size diff=695 |
| 1660308995289 | [2022/08/12 12:56:35] [debug] [task] created task=0x7f35e0a4a430 id=1 OK |
| 1660308996289 | [2022/08/12 12:56:36] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660308996295 | [2022/08/12 12:56:36] [error] [tls] error: unexpected EOF |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443 |
| 1660308996295 | [2022/08/12 12:56:36] [error] [output:loki:loki.1] no upstream connections available |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [out flush] cb_destroy coro_id=327 |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [retry] re-using retry for task_id=0 attempts=2 |
| 1660308996295 | [2022/08/12 12:56:36] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 15 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660308999947 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=657 |
| 1660308999978 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=651 |
| 1660309000289 | [2022/08/12 12:56:40] [debug] [task] created task=0x7f35e0a4a4a0 id=2 OK |
| 1660309011289 | [2022/08/12 12:56:51] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660309011300 | [2022/08/12 12:56:51] [error] [tls] error: unexpected EOF |
| 1660309011300 | [2022/08/12 12:56:51] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443 |
| 1660309011301 | [2022/08/12 12:56:51] [error] [output:loki:loki.1] no upstream connections available |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [out flush] cb_destroy coro_id=328 |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [retry] re-using retry for task_id=0 attempts=3 |
| 1660309011301 | [2022/08/12 12:56:51] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 11 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660309016156 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309016364 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309020289 | [2022/08/12 12:57:00] [debug] [task] created task=0x7f35e0a4a510 id=3 OK |
| 1660309022289 | [2022/08/12 12:57:02] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660309022299 | [2022/08/12 12:57:02] [error] [tls] error: unexpected EOF |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443 |
| 1660309022299 | [2022/08/12 12:57:02] [error] [output:loki:loki.1] no upstream connections available |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [out flush] cb_destroy coro_id=329 |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [retry] re-using retry for task_id=0 attempts=4 |
| 1660309022299 | [2022/08/12 12:57:02] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 66 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660309022809 | [2022/08/12 12:57:02] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309025290 | [2022/08/12 12:57:05] [debug] [task] created task=0x7f35e0a4a580 id=4 OK |
| 1660309046143 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309046357 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309050289 | [2022/08/12 12:57:30] [debug] [task] created task=0x7f35e0a4a5f0 id=5 OK |
| 1660309053057 | [2022/08/12 12:57:33] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309055289 | [2022/08/12 12:57:35] [debug] [task] created task=0x7f35e0a4a660 id=6 OK |
| 1660309076181 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309076364 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309080289 | [2022/08/12 12:58:00] [debug] [task] created task=0x7f35e0a4a6d0 id=7 OK |
| 1660309080963 | [2022/08/12 12:58:00] [debug] [input chunk] update output instances with new chunk size diff=546 |
| 1660309082603 | [2022/08/12 12:58:02] [debug] [input chunk] update output instances with new chunk size diff=651 |
| 1660309083323 | [2022/08/12 12:58:03] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309085289 | [2022/08/12 12:58:05] [debug] [task] created task=0x7f35e0a4a740 id=8 OK |
| 1660309088289 | [2022/08/12 12:58:08] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660309088298 | [2022/08/12 12:58:08] [error] [tls] error: unexpected EOF |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443 |
| 1660309088298 | [2022/08/12 12:58:08] [error] [output:loki:loki.1] no upstream connections available |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [out flush] cb_destroy coro_id=330 |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [retry] re-using retry for task_id=0 attempts=5 |
| 1660309088298 | [2022/08/12 12:58:08] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 131 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660309088644 | [2022/08/12 12:58:08] [debug] [input chunk] update output instances with new chunk size diff=972 |
| 1660309090290 | [2022/08/12 12:58:10] [debug] [task] created task=0x7f35e0a4a7b0 id=9 OK |
| 1660309106169 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309106381 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309110289 | [2022/08/12 12:58:30] [debug] [task] created task=0x7f35e0a4a820 id=10 OK |
| 1660309113599 | [2022/08/12 12:58:33] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309115289 | [2022/08/12 12:58:35] [debug] [task] created task=0x7f35e0a4a890 id=11 OK |
| 1660309136181 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309136423 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309140289 | [2022/08/12 12:59:00] [debug] [task] created task=0x7f35e0a4a900 id=12 OK |
| 1660309144188 | [2022/08/12 12:59:04] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309145290 | [2022/08/12 12:59:05] [debug] [task] created task=0x7f35e0a4a970 id=13 OK |
| 1660309166174 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309166398 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309170289 | [2022/08/12 12:59:30] [debug] [task] created task=0x7f35e0a4a9e0 id=14 OK |
| 1660309174493 | [2022/08/12 12:59:34] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309175290 | [2022/08/12 12:59:35] [debug] [task] created task=0x7f35e0a4aa50 id=15 OK |
| 1660309196186 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309196409 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497 |
| 1660309200289 | [2022/08/12 13:00:00] [debug] [task] created task=0x7f35e0a4aac0 id=16 OK |
| 1660309204764 | [2022/08/12 13:00:04] [debug] [input chunk] update output instances with new chunk size diff=477 |
| 1660309205289 | [2022/08/12 13:00:05] [debug] [task] created task=0x7f35e0a4ab30 id=17 OK |
| 1660309219293 | [2022/08/12 13:00:19] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0 |
| 1660309219298 | [2022/08/12 13:00:19] [error] [tls] error: unexpected EOF |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443 |
| 1660309219298 | [2022/08/12 13:00:19] [error] [output:loki:loki.1] no upstream connections available |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [out flush] cb_destroy coro_id=331 |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] task_id=0 reached retry-attempts limit 5/5 |
| 1660309219298 | [2022/08/12 13:00:19] [ warn] [engine] chunk '1-1660308985.805729984.flb' cannot be retried: task_id=0, input=forward.3 > output=loki.1 |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] destroy task=0x7f35e0a4a200 (task_id=0) |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=1 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=2 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=3 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=4 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=5 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=6 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=7 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=8 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=9 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=10 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=11 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=12 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=13 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=14 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=15 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=16 assigned to thread #0 |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=17 assigned to thread #0 |
| 1660309220465 | AWS for Fluent Bit Container Image Version 2.27.0 |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Fluent Bit Version Info
We are running the current stable version.
Cluster Details
- ECS cluster using EC2 (not Fargate)
- Load balancer
- Sidecar deployment
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 21 (18 by maintainers)
Any updates? How can I help to move this forward? We are still using the debug container because the non-debug one fails intermittently.
@wick02 Yea I think a different issue is needed for this. Also explain in it why the upstream loki doesn’t work for you: https://docs.fluentbit.io/manual/pipeline/outputs/loki