fluent-bit: Fluent bit stops sending logs to Loki when any of the outputs goes offline

Bug Report

Describe the bug When any one of the Elastic search output plugins goes offline there are no more logs available to query from the Loki instance (running in the container) When the offline output plugin is commented and restart fluent bit, the logs are resumed to be generated

To Reproduce

  • Use a similar configuration as below
  • For a few minutes initially some logs are produced after which within 20 -30 mins no more new logs are seen.
[SERVICE]
    Flush 5
    Daemon Off
    Log_Level info
    Parsers_File parsers-eos.conf
    Plugins_File plugins.conf

    HTTP_Server Off
    HTTP_Listen 0.0.0.0
    HTTP_Port   2020

[INPUT]
    Name        forward
    Listen      0.0.0.0
    Port        24224
    Chunk_Size  32
    Buffer_Size 64

[INPUT]
    Name              systemd
    Path              /run/log/journal/
    Read_From_Tail    On 
    Strip_Underscores On
    Mem_Buf_Limit     10MB
    Tag journal

[OUTPUT]
    name            loki
    match           *
    host            localhost
    port            3100
    labels          job=fluentbit

# Elasticsearch running on Azure
[OUTPUT]
    Name  es
    Match *
    Host  ..cloudapp.azure.com
    Port  9200
    Type  docker
    Logstash_Format On
    Time_Key        @fbTimestamp
    Trace_Output  Off
    Trace_Error   On

# Elasticsearch running on local machine
[OUTPUT]
    Name  es
    Match *
    Host  192.168.1.1
    Port  9200
    Type  docker
    Logstash_Format On
    Time_Key        @fbTimestamp
    Retry_Limit     1

Expected behavior No stoppage of the logs to the connected loki output

Your Environment

  • Version used: Fluent Bit v1.6.9 , /grafana-loki:2.1.0
  • Operating System and version: Linux aarch64

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 24 (8 by maintainers)

Most upvoted comments

Hi, still happenning with fluent-bit 1.9.5 and loki 2.5.0.

Config:

[OUTPUT]
    name                   loki
    match                  *
    labels                 namespace=$kubernetes['namespace_name'],pod=$kubernetes['pod_name'],container=$kubernetes['container_name'],app=$kubernetes['labels']['k8s-app'] 
    Host                   loki
    Port                   3100
    auto_kubernetes_labels Off
    net.connect_timeout    120

This issue was closed because it has been stalled for 5 days with no activity.

I believe this should be reopened and looked at thoroughly

This issue was closed because it has been stalled for 5 days with no activity.

Here is my fluent-bit conf:

[INPUT]
  Name             tail
  Multiline.parser cri
  Tag              processing.log.*
  Path             /var/log/containers/*.log
  DB               /var/log/fluentbit_processing_log.db
  DB.Sync          Normal
  Rotate_Wait      10
  Buffer_Max_Size  1MB
  Mem_Buf_Limit    50MB
  storage.type     filesystem
[...]
[OUTPUT]
  Name loki
  Match processing.log.*
  Labels log_type=container,pod=$pod_name,namespace=$namespace_name,node=$host,container=$container_name,container_image=$container_image
  Host loki-distributor.logging.svc.cluster.local
  Remove_keys container_hash,container_image,container_name,docker_id,host,namespace_name,pod_id,pod_name,pod_name
  Drop_single_key true
  Line_format key_value
  Retry_Limit     4
  Port 3100

The error 500 happens somewhat randomly (probably linked to https://github.com/grafana/loki/issues/6227) and makes the output stop permanently. If if delete the loki service for a few minutes, fluent-bit throws:

[ warn] [engine] failed to flush chunk '1-1653901354.863465632.flb', retry in 11 seconds: task_id=27, input=tail.0 > output=loki.0 (out_id=0)                                                                      │
[error] [output:loki:loki.0] loki-distributor.logging.svc.cluster.local:3100, HTTP status=504 
[error] [output:loki:loki.1] no upstream connections available

and the plugin does not recover. Errors 400 also happens for different reasons but the loki output plugins recovers afterwards.

Keeping this up because the issue is still present

Actually not, its working, nothing to do with the exponenetial retry strategy values but after a moment it reconnect successfully

Same here

I second this; experiencing the same issue with Loki.

Still interested in this issue

still interested in this issue