fluent-bit: fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?))
Bug Report
Describe the bug
Under certain circumstances fluent-bit can completely freeze, not processing any more logs anymore. We saw this behavior quite some time, for many pods already, with a certain randomness of appearing or not.
We do rely on storage configuration, and my concern is that only with storage enabled this bug can appear. Probably in combination with higher load, and certain (mis?)configuration (see our config below). However, it’s unfortunately a serious issue for us, as storage buffering should make logging outage more resilient, but with this it’s actually the other way round 😃
I could reproduce the issue when running a log generator generating 200 msg à 1 kB of data per second, while denying any outgoing traffic from the fluent-bit pod. After a while, one out of three times our pod froze and did not process any more logs at all.
Having the debug-image of fluent-bit present, I could make the following interesting discovery: All sockets at some point did end up in CLOSE_WAIT state, find here my netstat debugging while it occurred:
/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 fluent-bit-jfz6p:60382 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:59954 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:59962 dtag-laas.netic.dk:https ESTABLISHED
[...]
tcp 0 0 fluent-bit-jfz6p:60380 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60054 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60172 dtag-laas.netic.dk:https ESTABLISHED
tcp 10106 0 fluent-bit-jfz6p:59820 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60470 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60370 dtag-laas.netic.dk:https ESTABLISHED
tcp 641 0 fluent-bit-jfz6p:59802 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:59952 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60068 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:2020 10.2.131.169:40492 TIME_WAIT
tcp 0 0 fluent-bit-jfz6p:60638 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60292 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 1 fluent-bit-jfz6p:60614 dtag-laas.netic.dk:https SYN_SENT
tcp 0 0 fluent-bit-jfz6p:60362 dtag-laas.netic.dk:https ESTABLISHED
tcp 11355 0 fluent-bit-jfz6p:60716 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60308 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:59836 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:59882 dtag-laas.netic.dk:https ESTABLISHED
tcp 0 0 fluent-bit-jfz6p:60586 dtag-laas.netic.dk:https ESTABLISHED
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags Type State I-Node Path
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
199
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
7
/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 32 0 fluent-bit-jfz6p:60382 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 32 0 fluent-bit-jfz6p:59954 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 1 0 fluent-bit-jfz6p:59962 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 32 0 fluent-bit-jfz6p:60354 dtag-laas.netic.dk:https CLOSE_WAIT
[...]
tcp 32 0 fluent-bit-jfz6p:60068 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 0 0 fluent-bit-jfz6p:2020 10.2.131.169:40492 TIME_WAIT
tcp 1 0 fluent-bit-jfz6p:60292 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 0 0 fluent-bit-jfz6p:60614 dtag-laas.netic.dk:https ESTABLISHED
tcp 32 0 fluent-bit-jfz6p:59976 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 32 0 fluent-bit-jfz6p:60620 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 32 0 fluent-bit-jfz6p:60362 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 11387 0 fluent-bit-jfz6p:60716 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 1 0 fluent-bit-jfz6p:60298 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 32 0 fluent-bit-jfz6p:59882 dtag-laas.netic.dk:https CLOSE_WAIT
tcp 32 0 fluent-bit-jfz6p:60586 dtag-laas.netic.dk:https CLOSE_WAIT
Note that, I stripped a few lines, actually there were around ~200 sockets, with all the same state respectively.
While that happens, our /flb-storage/ usually countains a few hundred files which would not change anymore, since the pod froze.
The logs of such affected pod look like this:
[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632705.229524992.flb', retry in 8 seconds: task_id=37, input=tail.0 > output=es.2
[2021/01/14 13:58:38] [error] [upstream] connection #170 to dtag-laas.netic.dk:443 timed out after 10 seconds
[...]
[2021/01/14 13:58:38] [error] [upstream] connection #1271 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [error] [upstream] connection #896 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632704.928894552.flb', retry in 7 seconds: task_id=18, input=tail.0 > output=es.0
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[...]
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.103450536.flb', retry in 10 seconds: task_id=28, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.26765971.flb', retry in 9 seconds: task_id=24, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604215 watch_fd=97 name=/var/log/containers/admin-management-59b684bf76-7dbqk_staging_admin-management-025be4bb58a5958704f1f80e8ed72322c98402e2f756e71708e4d0c56e5b842c.log
[...]
[2021/01/14 13:58:40] [ warn] [engine] failed to flush chunk '1-1610632705.269473234.flb', retry in 9 seconds: task_id=39, input=tail.0 > output=es.0
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604213 watch_fd=98 name=/var/log/containers/skill-e2e-tests-86598c5668-2qxpg_staging_skill-e2e-tests-e1e165a7ed49643b0e645fe5ad7f111d0e0472d1d38ab9875cf43da7210782b4.log
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=208216295 watch_fd=99 name=/var/log/containers/skill-e2e-tests-birthday-5cd88db447-p6x86_prodshadow_skill-e2e-tests-birthday-99aa9d08d371222835ab92a28fcfe0ea3245459c30efa271db31c854bd8b622d.log
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632706.515332169.flb', retry in 10 seconds: task_id=108, input=tail.0 > output=es.2
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.205745559.flb', retry in 11 seconds: task_id=34, input=tail.0 > output=es.0
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.249240345.flb', retry in 9 seconds: task_id=38, input=tail.0 > output=es.0
… with no more logs ever following (= pod froze completely).
I assume this being a bug in fluent-bit, as it does not close sockets properly, and thus blocks itself somehow. Once I even saw 731 sockets being in CLOSE_WAIT within one pod.
Note that, I saw both cases: At few times, a pod restart would help, at others (in most cases), even restarting the pod would not help. Only when I cleared the storage folder /flb-storage/ and restart the pod, there would be a high chance of resurrection.
From my subjective judgement, the issue appear more often, if the load is high and fluent-bit has to process many container log files. When our backend (elasticsearch) was offline for a few hours we discovered this issue for the first time (about 30% of our pods were frozen).
Screenshots

As seen in screenshot, we have a short spike of data (disregard the time before the spike; the pod did spawn with the spike), and then no output anymore. Fluent-bit won’t process any more data.
Your Environment
- Version used: tested with both v1.6.9 and v1.6.10 which both show the bug
- Configuration (ConfigMap in Kubernetes):
custom_parsers.conf: |
[PARSER]
Name kube-svh
Format regex
Regex ((?<tag>[^.]+)\.)?(?<namespace_name>[^.]+)\.(?<pod_name>[^.]+)\.(?<container_name>[^.]+)\.(?<docker_id>[^.]+)$
fluent-bit.conf: |
[SERVICE]
Flush 1
Daemon Off
Log_Level info
Parsers_File parsers.conf
Parsers_File custom_parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /flb-storage/
storage.sync normal
storage.checksum off
storage.max_chunks_up 2048
storage.backlog.mem_limit 64M
storage.metrics on
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
DB /var/log/flb_kube.db
DB.Sync Normal
DB.locking false
Refresh_Interval 5
Mem_Buf_Limit 64MB
Skip_Long_Lines On
Tag kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
Tag_Regex (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})
Buffer_Chunk_Size 32k
Buffer_Max_Size 4M
storage.type filesystem
[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Regex_Parser kube-svh
Merge_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
Labels Off
Annotations Off
[FILTER]
Name modify
Match kube.*
Remove_Wildcard attrs
[OUTPUT]
Name es
Match kube.staging.*
Host XYZ
Port 443
Path /api/v1/ingest/elastic-bulk
Logstash_Format On
Retry_Limit False
Type _doc
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix staging-4
HTTP_User XYZ
HTTP_Passwd XYZ
Buffer_Size False
tls on
tls.verify on
tls.debug 1
storage.total_limit_size 30G
[OUTPUT]
Name es
Match kube.eventrouter.*
Host XYZ
Port 443
Path /api/v1/ingest/elastic-bulk
Logstash_Format On
Retry_Limit False
Type _doc
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix staging-4
HTTP_User XYZ
HTTP_Passwd XYZ
Buffer_Size False
tls on
tls.verify on
tls.debug 1
storage.total_limit_size 30G
[...]
Note: We have a few Output-Section that basically follow the pattern as shown here (that’s why the file ends with [...]).
Bonus Question: Looking at the configuration, do those values sound reasonable? We were not too sure about storage.max_chunks_up, storage.backlog.mem_limit and their relation to buffering sizes as specified in the tail input.
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.18
- Server type and version: n/a
- Operating System and version: n/a
- Filters and plugins: (see configuration)
Thanks a lot for your investigation! Looking forward.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 8
- Comments: 19 (8 by maintainers)
AFAICT there were no comments by committers here. what does it mean for the state of a project when the maintainers don’t take the time to respond to a very detailed bug report like this one?
Let me mention one more issue I saw when testing things out. On a few pods that were not affected, I saw the following log line being spammed over and over for hours:
The same connection #id, every few seconds, without ever ending. Only a pod restart could “resolve” it. Any idea what that issue might be related to?
I don’t think our network itself is corrupted, as few pods were always running fine, and only changes to or inside the fluent-bit could make them work properly again (a combination of pod restarts and flushing the storage folder).