fluent-bit: Fluent-Bit fails to (re)start if too many buffers have been accumulated on the filesystem storage
Bug Report
Describe the bug When using the filesystem storage and have Fluent-Bit accumulate a lot of files (ie.: output down) until thousands of files have been created, Fluent-Bit will be unable to restart/start correctly and will constantly loop with errors. This can increase the issue because a feedback loop of writing to the journal and reading from it will make both td-agent-bit and the journal services use a lot of cpu usage.
This seems related to the number of handles Fluent-Bit has opened:
lsof -p `pidof td-agent-bit` | wc -l
2041
Been looking at the code and seems like the function cb_queue_chunks in plugins/in_storage_backlog/sb.c is trying to load all of the chunks that are on the disk into memory. Some of the issues I found when Fluent-Bit tries to load buffers from the filesystem:
- “ctx->mem_limit” is set to FLB_STORAGE_BL_MEM_LIMIT if “storage.backlog.mem_limit” is not set and becomes 100MB by default when the documentation (https://docs.fluentbit.io/manual/administration/buffering-and-storage) indicates 5MB.
- The inner loop in the function cb_queue_chunks never stops if the “total” goes over the ctx->mem_limit threshold.
- When modifying the loop to break when “total >= ctx->mem_limit” is reached, the function cb_queue_chunks will be called every second, but “flb_input_chunk_total_size(in)” will always return “0” even if the previously loaded chunks have never been processed. Loading another batch of 100MB of buffers each time for example. This could lead to a situation where either the ram will be completely used or the limit of handles it can use will be reached if too many buffers have been accumulated.
- Calling “cio_chunk_down(chunk_instance->chunk);” before “sb_remove_chunk_from_segregated_backlogs(chunk_instance->chunk, ctx);” (Code) to close the handles make it possible to go through all the buffers, but I noticed a lot of timers have been created by “_mk_event_timeout_create” until this function also gave errors because it created too many. Fluent-Bit did manage to eventually recover with a lot of errors, but I did not inspect that part too much to understand what it does.
To Reproduce Using a configuration that writes buffers to the filesystem, force Fluent-Bit to accumulate a lot of buffers by having the output down or pointing to an invalid address.
Example of bash script used to write to the journal, called multiple times as a background process with a different parameter to simulate logs from different applications which the INPUT will split into different buffers, easily increasing the number of buffers created:
#!/bin/bash
while true
do
echo "$1 - Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." | systemd-cat -t $1 -p warning
done
Once a lot of buffers have been accumulated (ie.: ~2000 in my case), restart rd-agent-bit service and during the start up, it will register/queue all the buffers until it fails to load any more buffer and keep on looping indefinitely.
[...]
[error] [storage] cannot open/create /var/log/fluent/buf//systemd.0/1131418-1647664248.725660271.flb
[error] [storage] [cio file] cannot open chunk: systemd.0/1131418-1647664248.725660271.flb
[error] [storage] cannot open/create /var/log/fluent/buf//systemd.0/1131418-1647664248.725660271.flb
[error] [storage] [cio file] cannot open chunk: systemd.0/1131418-1647664248.725660271.flb
[lib/chunkio/src/cio_file.c:432 errno=24] Too many open files
[lib/chunkio/src/cio_file.c:432 errno=24] Too many open files
[error] [storage] cannot open/create /var/log/fluent/buf//systemd.0/1131418-1647664248.725660271.flb
[error] [storage] [cio file] cannot open chunk: systemd.0/1131418-1647664248.725660271.flb
[error] [storage] cannot open/create /var/log/fluent/buf//systemd.0/1131418-1647664248.725660271.flb
[...]
Expected behavior Fluent-Bit should only load as many buffers the storage.max_chunks_up or memory constraints have been configured, then load more when it is possible to do so.
Your Environment
- Version used: td-agent-bit 1.8.12
- Configuration:
[SERVICE]
# Flush records to destinations every 5s
Flush 5
# Run in foreground mode
Daemon Off
# Use 'info' verbosity for Fluent Bit logs
Log_Level info
# Standard parsers & plugins
Parsers_File parsers.conf
Plugins_File plugins.conf
# Enable built-in HTTP server for metrics
# Prometheus metrics: <host>:24231/api/v1/metrics/prometheus
HTTP_Server On
HTTP_Listen 192.168.128.2
HTTP_Port 24231
# Persistent storage path for buffering
storage.path /var/log/fluent/buf/
storage.max_chunks_up 128
[INPUT]
Name systemd
Tag system.journal.*
Path /var/log/journal
DB /var/log/fluent/journald-cursor.db
storage.type filesystem
mem_buf_limit 64M
# BEGIN Elasticsearch output
[OUTPUT] # elasticsearch destination 1
Name es
Match *
Retry_Limit False
Host 192.168.128.200
Port 9200
Index filebeat-7.2.0
tls Off
tls.verify On
# elasticsearch output HTTP_User placeholder
# elasticsearch output HTTP_Passwd placeholder
Type _doc
Generate_ID true
# END Elasticsearch output
- Environment name and version (e.g. Kubernetes? What version?): td-agent-bit service 1.8.12
- Operating System and version: AlmaLinux 8
Additional context This could happen in a situation where we update/restart systems or if a crash occurred when a lot of buffers have been accumulated.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (10 by maintainers)
Hi @lecaros, This was tested with both 1.8 and 1.9.
The issue is not just how many chunks but also the size of these chunks that were already present upon start of Fluent-Bit when using filesystem storage.
What I saw when debugging with gdb is that the function loading these chunks from the disk were checking if it was still under a memory limit, but the calculation of memory used so far was always returning zero. So it was loading all chunks into memory and this becomes an issue when you have hundreds of megabytes to gigabytes of accumulated buffer.