aws-for-fluent-bit: Stops sending logs after connection/tls failure

Describe the question/issue

The aws-for-fluent-bit log router stops sending logs to Loki through a HTTPS proxy after a connection/tls failure. The container sometimes exits shortly after and doesn’t have anything in its log to indicate why. This causes the entire ECS task to restart because I have the log router container essential=true so that we don’t lose logs for a long period of time.

I have searched the issues here and in the fluent-bit repo. I have also searched the Grafana and Fluent slack communities.

Configuration

Deployment:

ECS cluster (not Fargate)
Primary container is using awsfirelens to route logs to the aws-for-fluent-bit container
The aws-for-fluent-bit container is routing logs to a loki task in the same cluster
The loki task is using S3 for storage

Relevant parts of ECS task definition. The first container is the web app and the second is the log router:

{
  "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "loki",
          "host": "loki-qa.elimuinformatics.com",
          "port": "443",
          "tls": "On",
          "http_user": "loki",
          "http_passwd": "<hidden>",
          "net.keepalive": "false",
          "workers": "1",
          "Retry_Limit": "5",
          "labels": "env=qa,service=sapphire-web",
          "label_keys": "$ecs_task_definition,$ec2_instance_id",
          "remove_keys": "container_id,container_name,ecs_task_arn",
          "line_format": "key_value"
        }
      },
      "memory": 4096,
      "memoryReservation": 256,
      "dependsOn": [
        {
          "containerName": "sapphire-web-log-router",
          "condition": "HEALTHY"
        }
      ],
      "essential": true,
      "name": "sapphire-web-qa"
    },
    {
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/aws/ecs/firelens/sapphire-qa",
          "awslogs-region": "us-west-2",
          "awslogs-create-group": "true",
          "awslogs-stream-prefix": "firelens"
        }
      },
      "environment": [
        {
          "name": "FLB_LOG_LEVEL",
          "value": "debug"
        }
      ],
      "memoryReservation": 50,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "config-file-type": "file",
          "enable-ecs-log-metadata": "true",
          "config-file-value": "/extra.conf"
        }
      },
      "healthCheck": {
        "retries": 2,
        "command": [
          "CMD-SHELL",
          "curl -f http://127.0.0.1:2020/api/v1/uptime || exit 1"
        ],
        "timeout": 5,
        "interval": 10,
        "startPeriod": 30
      },
      "essential": true,
      "name": "sapphire-web-log-router"
    }
  ]
}

The extra.conf file contains:

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock
    Mem_Buf_Limit 2MB

[SERVICE]
    Flush 5
    Grace 30
    # Healh check setup
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_PORT 2020
    Health_Check On 
    HC_Errors_Count 5 
    HC_Retry_Failure_Count 5 
    HC_Period 30

Fluent Bit Log Output

Here’s a partial log file where the error starts, the container fails to send any more logs (even on the retries), and then exits - killing the entire task because I have essential=true.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                message                                                                                 |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <snip>                                                                                                                                                                                 |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [task] created task=0x7f35e0a4a200 id=0 OK                                                                                               |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660308990296 | [2022/08/12 12:56:30] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660308990296 | [2022/08/12 12:56:30] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [retry] new retry created for task_id=0 attempts=1                                                                                       |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [out flush] cb_destroy coro_id=326                                                                                                       |
| 1660308990296 | [2022/08/12 12:56:30] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 6 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)   |
| 1660308991749 | [2022/08/12 12:56:31] [debug] [input chunk] update output instances with new chunk size diff=638                                                                       |
| 1660308992333 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660308992689 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=672                                                                       |
| 1660308993279 | [2022/08/12 12:56:33] [debug] [input chunk] update output instances with new chunk size diff=695                                                                       |
| 1660308995289 | [2022/08/12 12:56:35] [debug] [task] created task=0x7f35e0a4a430 id=1 OK                                                                                               |
| 1660308996289 | [2022/08/12 12:56:36] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660308996295 | [2022/08/12 12:56:36] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660308996295 | [2022/08/12 12:56:36] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [out flush] cb_destroy coro_id=327                                                                                                       |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [retry] re-using retry for task_id=0 attempts=2                                                                                          |
| 1660308996295 | [2022/08/12 12:56:36] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 15 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660308999947 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=657                                                                       |
| 1660308999978 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=651                                                                       |
| 1660309000289 | [2022/08/12 12:56:40] [debug] [task] created task=0x7f35e0a4a4a0 id=2 OK                                                                                               |
| 1660309011289 | [2022/08/12 12:56:51] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309011300 | [2022/08/12 12:56:51] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309011300 | [2022/08/12 12:56:51] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309011301 | [2022/08/12 12:56:51] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [out flush] cb_destroy coro_id=328                                                                                                       |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [retry] re-using retry for task_id=0 attempts=3                                                                                          |
| 1660309011301 | [2022/08/12 12:56:51] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 11 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660309016156 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309016364 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309020289 | [2022/08/12 12:57:00] [debug] [task] created task=0x7f35e0a4a510 id=3 OK                                                                                               |
| 1660309022289 | [2022/08/12 12:57:02] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309022299 | [2022/08/12 12:57:02] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309022299 | [2022/08/12 12:57:02] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [out flush] cb_destroy coro_id=329                                                                                                       |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [retry] re-using retry for task_id=0 attempts=4                                                                                          |
| 1660309022299 | [2022/08/12 12:57:02] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 66 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660309022809 | [2022/08/12 12:57:02] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309025290 | [2022/08/12 12:57:05] [debug] [task] created task=0x7f35e0a4a580 id=4 OK                                                                                               |
| 1660309046143 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309046357 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309050289 | [2022/08/12 12:57:30] [debug] [task] created task=0x7f35e0a4a5f0 id=5 OK                                                                                               |
| 1660309053057 | [2022/08/12 12:57:33] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309055289 | [2022/08/12 12:57:35] [debug] [task] created task=0x7f35e0a4a660 id=6 OK                                                                                               |
| 1660309076181 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309076364 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309080289 | [2022/08/12 12:58:00] [debug] [task] created task=0x7f35e0a4a6d0 id=7 OK                                                                                               |
| 1660309080963 | [2022/08/12 12:58:00] [debug] [input chunk] update output instances with new chunk size diff=546                                                                       |
| 1660309082603 | [2022/08/12 12:58:02] [debug] [input chunk] update output instances with new chunk size diff=651                                                                       |
| 1660309083323 | [2022/08/12 12:58:03] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309085289 | [2022/08/12 12:58:05] [debug] [task] created task=0x7f35e0a4a740 id=8 OK                                                                                               |
| 1660309088289 | [2022/08/12 12:58:08] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309088298 | [2022/08/12 12:58:08] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309088298 | [2022/08/12 12:58:08] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [out flush] cb_destroy coro_id=330                                                                                                       |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [retry] re-using retry for task_id=0 attempts=5                                                                                          |
| 1660309088298 | [2022/08/12 12:58:08] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 131 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660309088644 | [2022/08/12 12:58:08] [debug] [input chunk] update output instances with new chunk size diff=972                                                                       |
| 1660309090290 | [2022/08/12 12:58:10] [debug] [task] created task=0x7f35e0a4a7b0 id=9 OK                                                                                               |
| 1660309106169 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309106381 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309110289 | [2022/08/12 12:58:30] [debug] [task] created task=0x7f35e0a4a820 id=10 OK                                                                                              |
| 1660309113599 | [2022/08/12 12:58:33] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309115289 | [2022/08/12 12:58:35] [debug] [task] created task=0x7f35e0a4a890 id=11 OK                                                                                              |
| 1660309136181 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309136423 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309140289 | [2022/08/12 12:59:00] [debug] [task] created task=0x7f35e0a4a900 id=12 OK                                                                                              |
| 1660309144188 | [2022/08/12 12:59:04] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309145290 | [2022/08/12 12:59:05] [debug] [task] created task=0x7f35e0a4a970 id=13 OK                                                                                              |
| 1660309166174 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309166398 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309170289 | [2022/08/12 12:59:30] [debug] [task] created task=0x7f35e0a4a9e0 id=14 OK                                                                                              |
| 1660309174493 | [2022/08/12 12:59:34] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309175290 | [2022/08/12 12:59:35] [debug] [task] created task=0x7f35e0a4aa50 id=15 OK                                                                                              |
| 1660309196186 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309196409 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309200289 | [2022/08/12 13:00:00] [debug] [task] created task=0x7f35e0a4aac0 id=16 OK                                                                                              |
| 1660309204764 | [2022/08/12 13:00:04] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309205289 | [2022/08/12 13:00:05] [debug] [task] created task=0x7f35e0a4ab30 id=17 OK                                                                                              |
| 1660309219293 | [2022/08/12 13:00:19] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309219298 | [2022/08/12 13:00:19] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309219298 | [2022/08/12 13:00:19] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [out flush] cb_destroy coro_id=331                                                                                                       |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] task_id=0 reached retry-attempts limit 5/5                                                                                        |
| 1660309219298 | [2022/08/12 13:00:19] [ warn] [engine] chunk '1-1660308985.805729984.flb' cannot be retried: task_id=0, input=forward.3 > output=loki.1                                |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] destroy task=0x7f35e0a4a200 (task_id=0)                                                                                           |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=1 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=2 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=3 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=4 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=5 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=6 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=7 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=8 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=9 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=10 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=11 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=12 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=13 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=14 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=15 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=16 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=17 assigned to thread #0                                                                                    |
| 1660309220465 | AWS for Fluent Bit Container Image Version 2.27.0                                                                                                                      |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Fluent Bit Version Info

We are running the current stable version.

Cluster Details

ECS cluster using EC2 (not Fargate)
Load balancer
Sidecar deployment

About this issue

Original URL
State: open
Created 2 years ago
Reactions: 1
Comments: 21 (18 by maintainers)

Most upvoted comments

Any updates? How can I help to move this forward? We are still using the debug container because the non-debug one fails intermittently.

hankwallace on Jan 15, 2024

@wick02 Yea I think a different issue is needed for this. Also explain in it why the upstream loki doesn’t work for you: https://docs.fluentbit.io/manual/pipeline/outputs/loki

PettitWesley on Mar 3, 2023