fluent-bit: fluentbit simply stops when reaches 2048 tasks
Bug Report
When the output http server is offline for a while, after fluentbit reaches 2048 tasks it simply stop, printing only the message re-schedule retry=XXXXXXXXX XXXX in the next 1 seconds
. Even after the http output server come back online again it does not start to send the logs (both tail and storage.backlog).
Even with the hot_reload enable, sending the SIGHUP signal does not reload/restart fluentbit process because service will shutdown when all remaining tasks are flushed
, but the tasks are not doing anything, are dead.
I’m using the latest v2.2.2 version and this is my config:
service:
flush: 5
hot_reload: on
http_server: on
log_level: warn
scheduler.cap: 300
storage.path: /data/buffer/
storage.max_chunks_up: 256
storage.backlog.mem_limit: 256M
storage.delete_irrecoverable_chunks: on
pipeline:
inputs:
- name: tail
db: /data/logs.db
refresh_interval: 5
read_from_head: true
buffer_max_size: 512K
buffer_chunk_size: 256K
static_batch_size: 256M
storage.type: filesystem
multiline.parser: docker, cri
tag: <namespace>.<workload>.<container>
tag_regex: (?<workload>[a-z0-9](?:[-a-z0-9]*[a-z0-9])?(?:\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace>[^_]+)_(?<container>.+)-([a-z0-9]{64})\.log$
path: /var/log/containers/*_kelvin-admin_*.log,/var/log/containers/*_kelvin_*.log,/var/log/containers/*_app_*.log
filters:
- name: record_modifier
match: "*"
remove_key: _p
- name: record_modifier
match: "*"
remove_key: stream
- name: modify
match: "*"
add: cluster ${KELVIN_CLUSTER_NAME}
- name: modify
match: "*"
add: node ${KUBERNETES_NODE_NAME}
- name: lua
match: "*"
call: tag_fields
code: |
function tag_fields(tag, timestamp, record)
tags = {}
for t in string.gmatch(tag, "[^%.]+") do
table.insert(tags, t)
end
record["namespace"] = tags[1]
record["pod"] = tags[2]
if record["namespace"] == "app" then
record["workload"] = string.gsub(tags[2], "-0$", "")
end
record["container"] = tags[3]
return 1, timestamp, record
end
- name: lua
match: "*"
call: log_fields
code: |
function log_fields(tag, timestamp, record)
cjson = require("cjson")
log = record["log"]
status, parsed_log = pcall(cjson.decode, log)
if status and type(parsed_log) == "table" then
if parsed_log["level"] then
record["level"] = string.lower(parsed_log["level"])
end
if parsed_log["logger"] then
record["logger"] = parsed_log["logger"]
end
end
return 1, timestamp, record
end
outputs:
- name: http
match: "*"
host: localhost
port: 443
workers: 5
tls: on
tls.verify: off
compress: gzip
format: json_lines
net.keepalive: off
retry_limit: no_limits
net.connect_timeout: 45
log_response_payload: false
storage.total_limit_size: 5G
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 15 (4 by maintainers)
Latest version of pull request #8601 resolves all issues we have observed earlier (since 1.6). High loads via tcp input plugin sent to opensearch plugin is working without issues.
I debugged the issue and this is what I found:
re-schedule retry=XXXXXXXXX XXXX in the next x seconds
endlessly.