fluent-bit: fluentbit simply stops when reaches 2048 tasks
Bug Report
When the output http server is offline for a while, after fluentbit reaches 2048 tasks it simply stop, printing only the message re-schedule retry=XXXXXXXXX XXXX in the next 1 seconds. Even after the http output server come back online again it does not start to send the logs (both tail and storage.backlog).
Even with the hot_reload enable, sending the SIGHUP signal does not reload/restart fluentbit process because service will shutdown when all remaining tasks are flushed, but the tasks are not doing anything, are dead.
I’m using the latest v2.2.2 version and this is my config:
service:
flush: 5
hot_reload: on
http_server: on
log_level: warn
scheduler.cap: 300
storage.path: /data/buffer/
storage.max_chunks_up: 256
storage.backlog.mem_limit: 256M
storage.delete_irrecoverable_chunks: on
pipeline:
inputs:
- name: tail
db: /data/logs.db
refresh_interval: 5
read_from_head: true
buffer_max_size: 512K
buffer_chunk_size: 256K
static_batch_size: 256M
storage.type: filesystem
multiline.parser: docker, cri
tag: <namespace>.<workload>.<container>
tag_regex: (?<workload>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container>.+)-([a-z0-9]{64})\.log$
path: /var/log/containers/*_kelvin-admin_*.log,/var/log/containers/*_kelvin_*.log,/var/log/containers/*_app_*.log
filters:
- name: record_modifier
match: "*"
remove_key: _p
- name: record_modifier
match: "*"
remove_key: stream
- name: modify
match: "*"
add: cluster ${KELVIN_CLUSTER_NAME}
- name: modify
match: "*"
add: node ${KUBERNETES_NODE_NAME}
- name: lua
match: "*"
call: tag_fields
code: |
function tag_fields(tag, timestamp, record)
tags = {}
for t in string.gmatch(tag, "[^%.]+") do
table.insert(tags, t)
end
record["namespace"] = tags[1]
record["pod"] = tags[2]
if record["namespace"] == "app" then
record["workload"] = string.gsub(tags[2], "-0$", "")
end
record["container"] = tags[3]
return 1, timestamp, record
end
- name: lua
match: "*"
call: log_fields
code: |
function log_fields(tag, timestamp, record)
cjson = require("cjson")
log = record["log"]
status, parsed_log = pcall(cjson.decode, log)
if status and type(parsed_log) == "table" then
if parsed_log["level"] then
record["level"] = string.lower(parsed_log["level"])
end
if parsed_log["logger"] then
record["logger"] = parsed_log["logger"]
end
end
return 1, timestamp, record
end
outputs:
- name: http
match: "*"
host: localhost
port: 443
workers: 5
tls: on
tls.verify: off
compress: gzip
format: json_lines
net.keepalive: off
retry_limit: no_limits
net.connect_timeout: 45
log_response_payload: false
storage.total_limit_size: 5G
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 15 (4 by maintainers)
Latest version of pull request #8601 resolves all issues we have observed earlier (since 1.6). High loads via tcp input plugin sent to opensearch plugin is working without issues.
I debugged the issue and this is what I found:
re-schedule retry=XXXXXXXXX XXXX in the next x secondsendlessly.