fluent-bit: fluent-bit with storage enabled completely freezes (due to sockets not being closed properly(?))

Bug Report

Describe the bug

Under certain circumstances fluent-bit can completely freeze, not processing any more logs anymore. We saw this behavior quite some time, for many pods already, with a certain randomness of appearing or not.

We do rely on storage configuration, and my concern is that only with storage enabled this bug can appear. Probably in combination with higher load, and certain (mis?)configuration (see our config below). However, it’s unfortunately a serious issue for us, as storage buffering should make logging outage more resilient, but with this it’s actually the other way round 😃

I could reproduce the issue when running a log generator generating 200 msg à 1 kB of data per second, while denying any outgoing traffic from the fluent-bit pod. After a while, one out of three times our pod froze and did not process any more logs at all.

Having the debug-image of fluent-bit present, I could make the following interesting discovery: All sockets at some point did end up in CLOSE_WAIT state, find here my netstat debugging while it occurred:

/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp        0      0 fluent-bit-jfz6p:60382  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59954  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59962  dtag-laas.netic.dk:https ESTABLISHED 
[...]
tcp        0      0 fluent-bit-jfz6p:60380  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60054  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60172  dtag-laas.netic.dk:https ESTABLISHED 
tcp    10106      0 fluent-bit-jfz6p:59820  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60470  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60370  dtag-laas.netic.dk:https ESTABLISHED 
tcp      641      0 fluent-bit-jfz6p:59802  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59952  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60068  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:2020   10.2.131.169:40492      TIME_WAIT   
tcp        0      0 fluent-bit-jfz6p:60638  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60292  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      1 fluent-bit-jfz6p:60614  dtag-laas.netic.dk:https SYN_SENT    
tcp        0      0 fluent-bit-jfz6p:60362  dtag-laas.netic.dk:https ESTABLISHED 
tcp    11355      0 fluent-bit-jfz6p:60716  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60308  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59836  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:59882  dtag-laas.netic.dk:https ESTABLISHED 
tcp        0      0 fluent-bit-jfz6p:60586  dtag-laas.netic.dk:https ESTABLISHED 
Active UNIX domain sockets (w/o servers)
Proto RefCnt Flags       Type       State         I-Node Path
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
199
/flb-storage/tail.0 # netstat | grep ESTABLISHED | wc -l
7
/flb-storage/tail.0 # netstat
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       
tcp       32      0 fluent-bit-jfz6p:60382  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:59954  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        1      0 fluent-bit-jfz6p:59962  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60354  dtag-laas.netic.dk:https CLOSE_WAIT  
[...]
tcp       32      0 fluent-bit-jfz6p:60068  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        0      0 fluent-bit-jfz6p:2020   10.2.131.169:40492      TIME_WAIT   
tcp        1      0 fluent-bit-jfz6p:60292  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        0      0 fluent-bit-jfz6p:60614  dtag-laas.netic.dk:https ESTABLISHED 
tcp       32      0 fluent-bit-jfz6p:59976  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60620  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60362  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp    11387      0 fluent-bit-jfz6p:60716  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp        1      0 fluent-bit-jfz6p:60298  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:59882  dtag-laas.netic.dk:https CLOSE_WAIT  
tcp       32      0 fluent-bit-jfz6p:60586  dtag-laas.netic.dk:https CLOSE_WAIT

Note that, I stripped a few lines, actually there were around ~200 sockets, with all the same state respectively.

While that happens, our /flb-storage/ usually countains a few hundred files which would not change anymore, since the pod froze.

The logs of such affected pod look like this:

[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/01/14 13:58:38] [error] [src/flb_http_client.c:1120 errno=25] Inappropriate ioctl for device
[2021/01/14 13:58:38] [ warn] [output:es:es.0] http_do=-1 URI=/api/v1/ingest/elastic-bulk/_bulk
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632705.229524992.flb', retry in 8 seconds: task_id=37, input=tail.0 > output=es.2
[2021/01/14 13:58:38] [error] [upstream] connection #170 to dtag-laas.netic.dk:443 timed out after 10 seconds
[...]
[2021/01/14 13:58:38] [error] [upstream] connection #1271 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [error] [upstream] connection #896 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/14 13:58:38] [ warn] [engine] failed to flush chunk '1-1610632704.928894552.flb', retry in 7 seconds: task_id=18, input=tail.0 > output=es.0
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[...]
[2021/01/14 13:58:38] [error] [io] TCP connection failed: dtag-laas.netic.dk:443 (Connection timed out)
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.103450536.flb', retry in 10 seconds: task_id=28, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ warn] [engine] failed to flush chunk '1-1610632705.26765971.flb', retry in 9 seconds: task_id=24, input=tail.0 > output=es.0
[2021/01/14 13:58:39] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604215 watch_fd=97 name=/var/log/containers/admin-management-59b684bf76-7dbqk_staging_admin-management-025be4bb58a5958704f1f80e8ed72322c98402e2f756e71708e4d0c56e5b842c.log
[...]
[2021/01/14 13:58:40] [ warn] [engine] failed to flush chunk '1-1610632705.269473234.flb', retry in 9 seconds: task_id=39, input=tail.0 > output=es.0
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=149604213 watch_fd=98 name=/var/log/containers/skill-e2e-tests-86598c5668-2qxpg_staging_skill-e2e-tests-e1e165a7ed49643b0e645fe5ad7f111d0e0472d1d38ab9875cf43da7210782b4.log
[2021/01/14 13:58:40] [ info] [input:tail:tail.0] inotify_fs_add(): inode=208216295 watch_fd=99 name=/var/log/containers/skill-e2e-tests-birthday-5cd88db447-p6x86_prodshadow_skill-e2e-tests-birthday-99aa9d08d371222835ab92a28fcfe0ea3245459c30efa271db31c854bd8b622d.log
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632706.515332169.flb', retry in 10 seconds: task_id=108, input=tail.0 > output=es.2
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.205745559.flb', retry in 11 seconds: task_id=34, input=tail.0 > output=es.0
[2021/01/14 13:58:41] [ warn] [engine] failed to flush chunk '1-1610632705.249240345.flb', retry in 9 seconds: task_id=38, input=tail.0 > output=es.0

… with no more logs ever following (= pod froze completely).

I assume this being a bug in fluent-bit, as it does not close sockets properly, and thus blocks itself somehow. Once I even saw 731 sockets being in CLOSE_WAIT within one pod.

Note that, I saw both cases: At few times, a pod restart would help, at others (in most cases), even restarting the pod would not help. Only when I cleared the storage folder /flb-storage/ and restart the pod, there would be a high chance of resurrection.

From my subjective judgement, the issue appear more often, if the load is high and fluent-bit has to process many container log files. When our backend (elasticsearch) was offline for a few hours we discovered this issue for the first time (about 30% of our pods were frozen).

Screenshots no-output

As seen in screenshot, we have a short spike of data (disregard the time before the spike; the pod did spawn with the spike), and then no output anymore. Fluent-bit won’t process any more data.

Your Environment

Version used: tested with both v1.6.9 and v1.6.10 which both show the bug
Configuration (ConfigMap in Kubernetes):

  custom_parsers.conf: |
    [PARSER]
        Name kube-svh
        Format regex
        Regex ((?<tag>[^.]+)\.)?(?<namespace_name>[^.]+)\.(?<pod_name>[^.]+)\.(?<container_name>[^.]+)\.(?<docker_id>[^.]+)$
  fluent-bit.conf: |
    [SERVICE]
        Flush 1
        Daemon Off
        Log_Level info
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        storage.path /flb-storage/
        storage.sync normal
        storage.checksum off
        storage.max_chunks_up 2048
        storage.backlog.mem_limit 64M
        storage.metrics on

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        DB.Sync           Normal
        DB.locking        false
        Refresh_Interval  5
        Mem_Buf_Limit     64MB
        Skip_Long_Lines   On
        Tag               kube.<namespace_name>.<pod_name>.<container_name>.<docker_id>
        Tag_Regex         (?<pod_name>[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})
        Buffer_Chunk_Size 32k
        Buffer_Max_Size   4M
        storage.type      filesystem

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_Tag_Prefix     kube.
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Regex_Parser        kube-svh
        Merge_Log           Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              Off
        Annotations         Off

    [FILTER]
        Name modify
        Match kube.*
        Remove_Wildcard attrs

    [OUTPUT]
        Name  es
        Match kube.staging.*
        Host  XYZ
        Port  443
        Path  /api/v1/ingest/elastic-bulk
        Logstash_Format On
        Retry_Limit False
        Type  _doc
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix staging-4

        HTTP_User XYZ
        HTTP_Passwd XYZ

        Buffer_Size False

        tls on
        tls.verify on
        tls.debug 1

        storage.total_limit_size 30G
    [OUTPUT]
        Name  es
        Match kube.eventrouter.*
        Host  XYZ
        Port  443
        Path  /api/v1/ingest/elastic-bulk
        Logstash_Format On
        Retry_Limit False
        Type  _doc
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix staging-4

        HTTP_User XYZ
        HTTP_Passwd XYZ

        Buffer_Size False

        tls on
        tls.verify on
        tls.debug 1

        storage.total_limit_size 30G
    [...]

Note: We have a few Output-Section that basically follow the pattern as shown here (that’s why the file ends with [...]).

Bonus Question: Looking at the configuration, do those values sound reasonable? We were not too sure about storage.max_chunks_up, storage.backlog.mem_limit and their relation to buffering sizes as specified in the tail input.

Environment name and version (e.g. Kubernetes? What version?): Kubernetes 1.18
Server type and version: n/a
Operating System and version: n/a
Filters and plugins: (see configuration)

Thanks a lot for your investigation! Looking forward.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 8
Comments: 19 (8 by maintainers)

Most upvoted comments

AFAICT there were no comments by committers here. what does it mean for the state of a project when the maintainers don’t take the time to respond to a very detailed bug report like this one?

knutwannheden on Mar 18, 2021

Let me mention one more issue I saw when testing things out. On a few pods that were not affected, I saw the following log line being spammed over and over for hours:

[2021/01/13 22:47:54] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/13 22:47:56] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds
[2021/01/13 22:47:57] [error] [upstream] connection #237 to dtag-laas.netic.dk:443 timed out after 10 seconds

The same connection #id, every few seconds, without ever ending. Only a pod restart could “resolve” it. Any idea what that issue might be related to?

I don’t think our network itself is corrupted, as few pods were always running fine, and only changes to or inside the fluent-bit could make them work properly again (a combination of pod restarts and flushing the storage folder).

sjentzsch on Jan 15, 2021