fluent-bit: fluentbit simply stops when reaches 2048 tasks

Bug Report

When the output http server is offline for a while, after fluentbit reaches 2048 tasks it simply stop, printing only the message re-schedule retry=XXXXXXXXX XXXX in the next 1 seconds. Even after the http output server come back online again it does not start to send the logs (both tail and storage.backlog).

Even with the hot_reload enable, sending the SIGHUP signal does not reload/restart fluentbit process because service will shutdown when all remaining tasks are flushed, but the tasks are not doing anything, are dead.

I’m using the latest v2.2.2 version and this is my config:

    service:
      flush: 5
      hot_reload: on
      http_server: on
      log_level: warn
      scheduler.cap: 300
      storage.path: /data/buffer/
      storage.max_chunks_up: 256
      storage.backlog.mem_limit: 256M
      storage.delete_irrecoverable_chunks: on

    pipeline:
      inputs:
        - name: tail
          db: /data/logs.db
          refresh_interval: 5
          read_from_head: true
          buffer_max_size: 512K
          buffer_chunk_size: 256K
          static_batch_size: 256M
          storage.type: filesystem
          multiline.parser: docker, cri
          tag: <namespace>.<workload>.<container>
          tag_regex: (?<workload>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container>.+)-([a-z0-9]{64})\.log$
          path: /var/log/containers/*_kelvin-admin_*.log,/var/log/containers/*_kelvin_*.log,/var/log/containers/*_app_*.log

      filters:
        - name: record_modifier
          match: "*"
          remove_key: _p

        - name: record_modifier
          match: "*"
          remove_key: stream

        - name: modify
          match: "*"
          add: cluster ${KELVIN_CLUSTER_NAME}

        - name: modify
          match: "*"
          add: node ${KUBERNETES_NODE_NAME}

        - name: lua
          match: "*"
          call: tag_fields
          code: |
            function tag_fields(tag, timestamp, record)
              tags = {}

              for t in string.gmatch(tag, "[^%.]+") do
                table.insert(tags, t)
              end

              record["namespace"] = tags[1]
              record["pod"] = tags[2]

              if record["namespace"] == "app" then
                record["workload"] = string.gsub(tags[2], "-0$", "")
              end

              record["container"] = tags[3]

              return 1, timestamp, record
            end

        - name: lua
          match: "*"
          call: log_fields
          code: |
            function log_fields(tag, timestamp, record)
              cjson = require("cjson")

              log = record["log"]
              status, parsed_log = pcall(cjson.decode, log)

              if status and type(parsed_log) == "table" then
                if parsed_log["level"] then
                  record["level"] = string.lower(parsed_log["level"])
                end
                if parsed_log["logger"] then
                  record["logger"] = parsed_log["logger"]
                end
              end
            
              return 1, timestamp, record
            end

      outputs:
        - name: http
          match: "*"
          host: localhost
          port: 443
          workers: 5
          tls: on
          tls.verify: off
          compress: gzip
          format: json_lines
          net.keepalive: off
          retry_limit: no_limits
          net.connect_timeout: 45
          log_response_payload: false
          storage.total_limit_size: 5G

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 15 (4 by maintainers)

Most upvoted comments

Latest version of pull request #8601 resolves all issues we have observed earlier (since 1.6). High loads via tcp input plugin sent to opensearch plugin is working without issues.

sirwio on Mar 25, 2024

I debugged the issue and this is what I found:

Fluentbit receives chunks and assigns each arriving chunk a task, until the 2048 task map is filled.
After the 2048 tasks are busy and a new chunk arrives, fluentbit fails in this line. Additionally, fluentbit doesn’t delete the chunk from memory (in case the filesystem storage is in use), and therefore the chunk takes a space from the storage.max_chunks_up permanently. <- BUG
As new chunks arrive, fluentbit keeps failing to assign them a task and fills up all the space in memory (storage.max_chunks_up is maxed out).
When the chunks that already have a task assigned try to do their job, they first must be brought into memory. Nevertheless, since the new chunks have taken all the memory, they fail with the message re-schedule retry=XXXXXXXXX XXXX in the next x seconds endlessly.

seblaz on Mar 19, 2024