aws-for-fluent-bit: Stops sending logs after connection/tls failure

Describe the question/issue

The aws-for-fluent-bit log router stops sending logs to Loki through a HTTPS proxy after a connection/tls failure. The container sometimes exits shortly after and doesn’t have anything in its log to indicate why. This causes the entire ECS task to restart because I have the log router container essential=true so that we don’t lose logs for a long period of time.

I have searched the issues here and in the fluent-bit repo. I have also searched the Grafana and Fluent slack communities.

Configuration

Deployment:

  • ECS cluster (not Fargate)
  • Primary container is using awsfirelens to route logs to the aws-for-fluent-bit container
  • The aws-for-fluent-bit container is routing logs to a loki task in the same cluster
  • The loki task is using S3 for storage

Relevant parts of ECS task definition. The first container is the web app and the second is the log router:

{
  "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "loki",
          "host": "loki-qa.elimuinformatics.com",
          "port": "443",
          "tls": "On",
          "http_user": "loki",
          "http_passwd": "<hidden>",
          "net.keepalive": "false",
          "workers": "1",
          "Retry_Limit": "5",
          "labels": "env=qa,service=sapphire-web",
          "label_keys": "$ecs_task_definition,$ec2_instance_id",
          "remove_keys": "container_id,container_name,ecs_task_arn",
          "line_format": "key_value"
        }
      },
      "memory": 4096,
      "memoryReservation": 256,
      "dependsOn": [
        {
          "containerName": "sapphire-web-log-router",
          "condition": "HEALTHY"
        }
      ],
      "essential": true,
      "name": "sapphire-web-qa"
    },
    {
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/aws/ecs/firelens/sapphire-qa",
          "awslogs-region": "us-west-2",
          "awslogs-create-group": "true",
          "awslogs-stream-prefix": "firelens"
        }
      },
      "environment": [
        {
          "name": "FLB_LOG_LEVEL",
          "value": "debug"
        }
      ],
      "memoryReservation": 50,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "config-file-type": "file",
          "enable-ecs-log-metadata": "true",
          "config-file-value": "/extra.conf"
        }
      },
      "healthCheck": {
        "retries": 2,
        "command": [
          "CMD-SHELL",
          "curl -f http://127.0.0.1:2020/api/v1/uptime || exit 1"
        ],
        "timeout": 5,
        "interval": 10,
        "startPeriod": 30
      },
      "essential": true,
      "name": "sapphire-web-log-router"
    }
  ]
}

The extra.conf file contains:

[INPUT]
    Name forward
    unix_path /var/run/fluent.sock
    Mem_Buf_Limit 2MB

[SERVICE]
    Flush 5
    Grace 30
    # Healh check setup
    HTTP_Server On
    HTTP_Listen 0.0.0.0
    HTTP_PORT 2020
    Health_Check On 
    HC_Errors_Count 5 
    HC_Retry_Failure_Count 5 
    HC_Period 30

Fluent Bit Log Output

Here’s a partial log file where the error starts, the container fails to send any more logs (even on the retries), and then exits - killing the entire task because I have essential=true.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   timestamp   |                                                                                message                                                                                 |
|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <snip>                                                                                                                                                                                 |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [task] created task=0x7f35e0a4a200 id=0 OK                                                                                               |
| 1660308990289 | [2022/08/12 12:56:30] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660308990296 | [2022/08/12 12:56:30] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660308990296 | [2022/08/12 12:56:30] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [retry] new retry created for task_id=0 attempts=1                                                                                       |
| 1660308990296 | [2022/08/12 12:56:30] [debug] [out flush] cb_destroy coro_id=326                                                                                                       |
| 1660308990296 | [2022/08/12 12:56:30] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 6 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)   |
| 1660308991749 | [2022/08/12 12:56:31] [debug] [input chunk] update output instances with new chunk size diff=638                                                                       |
| 1660308992333 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660308992689 | [2022/08/12 12:56:32] [debug] [input chunk] update output instances with new chunk size diff=672                                                                       |
| 1660308993279 | [2022/08/12 12:56:33] [debug] [input chunk] update output instances with new chunk size diff=695                                                                       |
| 1660308995289 | [2022/08/12 12:56:35] [debug] [task] created task=0x7f35e0a4a430 id=1 OK                                                                                               |
| 1660308996289 | [2022/08/12 12:56:36] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660308996295 | [2022/08/12 12:56:36] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660308996295 | [2022/08/12 12:56:36] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [out flush] cb_destroy coro_id=327                                                                                                       |
| 1660308996295 | [2022/08/12 12:56:36] [debug] [retry] re-using retry for task_id=0 attempts=2                                                                                          |
| 1660308996295 | [2022/08/12 12:56:36] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 15 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660308999947 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=657                                                                       |
| 1660308999978 | [2022/08/12 12:56:39] [debug] [input chunk] update output instances with new chunk size diff=651                                                                       |
| 1660309000289 | [2022/08/12 12:56:40] [debug] [task] created task=0x7f35e0a4a4a0 id=2 OK                                                                                               |
| 1660309011289 | [2022/08/12 12:56:51] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309011300 | [2022/08/12 12:56:51] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309011300 | [2022/08/12 12:56:51] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309011301 | [2022/08/12 12:56:51] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [out flush] cb_destroy coro_id=328                                                                                                       |
| 1660309011301 | [2022/08/12 12:56:51] [debug] [retry] re-using retry for task_id=0 attempts=3                                                                                          |
| 1660309011301 | [2022/08/12 12:56:51] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 11 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660309016156 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309016364 | [2022/08/12 12:56:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309020289 | [2022/08/12 12:57:00] [debug] [task] created task=0x7f35e0a4a510 id=3 OK                                                                                               |
| 1660309022289 | [2022/08/12 12:57:02] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309022299 | [2022/08/12 12:57:02] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309022299 | [2022/08/12 12:57:02] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [out flush] cb_destroy coro_id=329                                                                                                       |
| 1660309022299 | [2022/08/12 12:57:02] [debug] [retry] re-using retry for task_id=0 attempts=4                                                                                          |
| 1660309022299 | [2022/08/12 12:57:02] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 66 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1)  |
| 1660309022809 | [2022/08/12 12:57:02] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309025290 | [2022/08/12 12:57:05] [debug] [task] created task=0x7f35e0a4a580 id=4 OK                                                                                               |
| 1660309046143 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309046357 | [2022/08/12 12:57:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309050289 | [2022/08/12 12:57:30] [debug] [task] created task=0x7f35e0a4a5f0 id=5 OK                                                                                               |
| 1660309053057 | [2022/08/12 12:57:33] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309055289 | [2022/08/12 12:57:35] [debug] [task] created task=0x7f35e0a4a660 id=6 OK                                                                                               |
| 1660309076181 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309076364 | [2022/08/12 12:57:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309080289 | [2022/08/12 12:58:00] [debug] [task] created task=0x7f35e0a4a6d0 id=7 OK                                                                                               |
| 1660309080963 | [2022/08/12 12:58:00] [debug] [input chunk] update output instances with new chunk size diff=546                                                                       |
| 1660309082603 | [2022/08/12 12:58:02] [debug] [input chunk] update output instances with new chunk size diff=651                                                                       |
| 1660309083323 | [2022/08/12 12:58:03] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309085289 | [2022/08/12 12:58:05] [debug] [task] created task=0x7f35e0a4a740 id=8 OK                                                                                               |
| 1660309088289 | [2022/08/12 12:58:08] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309088298 | [2022/08/12 12:58:08] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309088298 | [2022/08/12 12:58:08] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [out flush] cb_destroy coro_id=330                                                                                                       |
| 1660309088298 | [2022/08/12 12:58:08] [debug] [retry] re-using retry for task_id=0 attempts=5                                                                                          |
| 1660309088298 | [2022/08/12 12:58:08] [ warn] [engine] failed to flush chunk '1-1660308985.805729984.flb', retry in 131 seconds: task_id=0, input=forward.3 > output=loki.1 (out_id=1) |
| 1660309088644 | [2022/08/12 12:58:08] [debug] [input chunk] update output instances with new chunk size diff=972                                                                       |
| 1660309090290 | [2022/08/12 12:58:10] [debug] [task] created task=0x7f35e0a4a7b0 id=9 OK                                                                                               |
| 1660309106169 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309106381 | [2022/08/12 12:58:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309110289 | [2022/08/12 12:58:30] [debug] [task] created task=0x7f35e0a4a820 id=10 OK                                                                                              |
| 1660309113599 | [2022/08/12 12:58:33] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309115289 | [2022/08/12 12:58:35] [debug] [task] created task=0x7f35e0a4a890 id=11 OK                                                                                              |
| 1660309136181 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309136423 | [2022/08/12 12:58:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309140289 | [2022/08/12 12:59:00] [debug] [task] created task=0x7f35e0a4a900 id=12 OK                                                                                              |
| 1660309144188 | [2022/08/12 12:59:04] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309145290 | [2022/08/12 12:59:05] [debug] [task] created task=0x7f35e0a4a970 id=13 OK                                                                                              |
| 1660309166174 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309166398 | [2022/08/12 12:59:26] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309170289 | [2022/08/12 12:59:30] [debug] [task] created task=0x7f35e0a4a9e0 id=14 OK                                                                                              |
| 1660309174493 | [2022/08/12 12:59:34] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309175290 | [2022/08/12 12:59:35] [debug] [task] created task=0x7f35e0a4aa50 id=15 OK                                                                                              |
| 1660309196186 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309196409 | [2022/08/12 12:59:56] [debug] [input chunk] update output instances with new chunk size diff=497                                                                       |
| 1660309200289 | [2022/08/12 13:00:00] [debug] [task] created task=0x7f35e0a4aac0 id=16 OK                                                                                              |
| 1660309204764 | [2022/08/12 13:00:04] [debug] [input chunk] update output instances with new chunk size diff=477                                                                       |
| 1660309205289 | [2022/08/12 13:00:05] [debug] [task] created task=0x7f35e0a4ab30 id=17 OK                                                                                              |
| 1660309219293 | [2022/08/12 13:00:19] [debug] [output:loki:loki.1] task_id=0 assigned to thread #0                                                                                     |
| 1660309219298 | [2022/08/12 13:00:19] [error] [tls] error: unexpected EOF                                                                                                              |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [upstream] connection #75 failed to loki-qa.elimuinformatics.com:443                                                                     |
| 1660309219298 | [2022/08/12 13:00:19] [error] [output:loki:loki.1] no upstream connections available                                                                                   |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [out flush] cb_destroy coro_id=331                                                                                                       |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] task_id=0 reached retry-attempts limit 5/5                                                                                        |
| 1660309219298 | [2022/08/12 13:00:19] [ warn] [engine] chunk '1-1660308985.805729984.flb' cannot be retried: task_id=0, input=forward.3 > output=loki.1                                |
| 1660309219298 | [2022/08/12 13:00:19] [debug] [task] destroy task=0x7f35e0a4a200 (task_id=0)                                                                                           |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=1 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=2 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=3 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=4 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=5 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=6 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=7 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=8 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=9 assigned to thread #0                                                                                     |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=10 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=11 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=12 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=13 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=14 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=15 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=16 assigned to thread #0                                                                                    |
| 1660309220289 | [2022/08/12 13:00:20] [debug] [output:loki:loki.1] task_id=17 assigned to thread #0                                                                                    |
| 1660309220465 | AWS for Fluent Bit Container Image Version 2.27.0                                                                                                                      |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Fluent Bit Version Info

We are running the current stable version.

Cluster Details

  • ECS cluster using EC2 (not Fargate)
  • Load balancer
  • Sidecar deployment

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 21 (18 by maintainers)

Most upvoted comments

Any updates? How can I help to move this forward? We are still using the debug container because the non-debug one fails intermittently.

@wick02 Yea I think a different issue is needed for this. Also explain in it why the upstream loki doesn’t work for you: https://docs.fluentbit.io/manual/pipeline/outputs/loki