fluent-bit: Fluent Bit stuck when lost connection with the internet

Bug Report

Describe the bug

I want to test that whether Fluent Bit will buffer logs in local file system in case that Fluent Bit fails to flush data to output destination. This could be caused by unstable internet connection, power down and so on… The first case I tested is no internet connection. Therefore I just created an offline environment by turning off Wifi on my local machine.

However, Fluent Bit struct when lost connection with the internet, and still not respond (Or after a long time… it will respond) after turning internet on.

To Reproduce I have a docker containing running on my local computer and generating logs to a specific file. The code for the log-generator is referenced from here.

Steps:

  1. Run the docker container
  2. Start Fluent Bit
  3. Once Fluent Bit starts to work, turn Wifi off for ~30s
  4. After 30s, turn Wifi on and Fluent Bit will stuck

And the logs of Fluent Bit:

Fluent Bit v1.5.0
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2020/07/17 14:29:21] [ info] Configuration:
[2020/07/17 14:29:21] [ info]  flush time     | 1.000000 seconds
[2020/07/17 14:29:21] [ info]  grace          | 5 seconds
[2020/07/17 14:29:21] [ info]  daemon         | 0
[2020/07/17 14:29:21] [ info] ___________
[2020/07/17 14:29:21] [ info]  inputs:
[2020/07/17 14:29:21] [ info]      tail
[2020/07/17 14:29:21] [ info] ___________
[2020/07/17 14:29:21] [ info]  filters:
[2020/07/17 14:29:21] [ info]      record_modifier.0
[2020/07/17 14:29:21] [ info] ___________
[2020/07/17 14:29:21] [ info]  outputs:
[2020/07/17 14:29:21] [ info]      stackdriver.0
[2020/07/17 14:29:21] [ info] ___________
[2020/07/17 14:29:21] [ info]  collectors:
[2020/07/17 14:29:21] [ info] [engine] started (pid=31137)
[2020/07/17 14:29:21] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/07/17 14:29:21] [debug] [storage] [cio scan] opening path /var/log/fluent-bit-buffers/
[2020/07/17 14:29:21] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/07/17 14:29:21] [debug] [storage] [cio scan] opening stream tail.0
[2020/07/17 14:29:21] [debug] [storage] tail.0:30392-1595010555.961076422.flb mapped OK
[2020/07/17 14:29:21] [debug] [storage] [cio stream] new stream registered: tail.0
[2020/07/17 14:29:21] [debug] [storage] [cio stream] new stream registered: storage_backlog.1
[2020/07/17 14:29:21] [ info] [storage] version=1.0.4, initializing...
[2020/07/17 14:29:21] [ info] [storage] root path '/var/log/fluent-bit-buffers/'
[2020/07/17 14:29:21] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/07/17 14:29:21] [ info] [storage] backlog input plugin: storage_backlog.1
[2020/07/17 14:29:21] [debug] [input:tail:tail.0] inotify watch fd=20
[2020/07/17 14:29:21] [debug] [input:tail:tail.0] scanning path /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log
[2020/07/17 14:29:21] [debug] [input:tail:tail.0] inode=3192531 appended as /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log
[2020/07/17 14:29:21] [debug] [input:tail:tail.0] scan_glob add(): /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log, inode 3192531
[2020/07/17 14:29:21] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 9.5M
[2020/07/17 14:29:21] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/30392-1595010555.961076422.flb
[2020/07/17 14:29:21] [debug] [output:stackdriver:stackdriver.0] JWT signature:
eyJhbGciOiAiUlMyNTYiLCAidHlwIjogIkpXVCJ9.eyJpc3MiOiAiamVmZmx1b28tZ2tlLWRldi1sb2dnaW5nQGplZmZsdW9vLWdrZS1kZXYuaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLCAic2NvcGUiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vYXV0aC9sb2dnaW5nLndyaXRlIiwgImF1ZCI6ICJodHRwczovL3d3dy5nb29nbGVhcGlzLmNvbS9vYXV0aDIvdjQvdG9rZW4iLCAiZXhwIjogMTU5NTAxMzU2MSwgImlhdCI6IDE1OTUwMTA1NjF9.LsoMQodIitU4_yHvF59_nqZ-8AVyUbyq6XU2Z8db245UVb8OMuCrqAn8kajiMWtYgrVlFBqrm8FAl-MFRtyL4A3BSxI9n77vaeyqRHdc6MNK2JbzTmXHv_pF3Mo8IUXMOK_o6yTJgoJaIom6XHAKuWzdP1Ke5sii7IsEShI2nSD8PIE4HGXJOXwDyOFwaTgGEh877QsfEhhPhoSDy45aE0P_Di-EuFRoaBd98mpwCWL55MoEh5gD3MickW5DlbZEW9p36qfjLszHd4Vg-IRmNzAyv1nuXmNlUrdtlHcQFy4Q1txW0wddRCzEw_Rgn5ArDNReK3-57EZfWWuUKcHAeA
[2020/07/17 14:29:21] [ info] [oauth2] HTTP Status=200
[2020/07/17 14:29:21] [debug] [oauth2] payload:
{"access_token":"ya29.c.Ko8B1AeKv0M_TT3RoumJIRw_KYseXvznb97eGhsP9nkFPevcYVjSRgUkDR1Dt8F6VjtXhCuIc4p1IxE-zemeHlfaUX_SE9_ixP95fZrYs8RPowELCPYhEDXPMnaAK7a9gNzAPiFGOC1PNXwFmD-I6iLF0WwJbgv62ewzcMC4wjRpkGODpUl1JMsCggt46QUGc38","expires_in":3599,"token_type":"Bearer"}
[2020/07/17 14:29:21] [ info] [oauth2] access token from 'www.googleapis.com:443' retrieved
[2020/07/17 14:29:21] [debug] [router] match rule tail.0:stackdriver.0
[2020/07/17 14:29:21] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2020/07/17 14:29:21] [ info] [sp] stream processor started
[2020/07/17 14:29:21] [debug] [storage] tail.0:31137-1595010561.575441987.flb mapped OK
[2020/07/17 14:29:21] [debug] [storage] [cio file] alloc_size from 4096 to 233472
[2020/07/17 14:29:21] [debug] [storage] [cio file] alloc_size from 233472 to 331776
[2020/07/17 14:29:21] [debug] [input:tail:tail.0] inode=3192531 file=/var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log promote to TAIL_EVENT
[2020/07/17 14:29:21] [ info] inotify_fs_add(): inode=3192531 watch_fd=1 name=/var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log
[2020/07/17 14:29:22] [debug] [task] created task=0x5cafad1150e0 id=0 OK
[2020/07/17 14:29:22] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:29:22] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:29:22] [debug] [storage] tail.0:30392-1595010555.961076422.flb mapped OK
[2020/07/17 14:29:22] [ info] [input:storage_backlog:storage_backlog.1] queueing tail.0:30392-1595010555.961076422.flb
[2020/07/17 14:29:22] [debug] [task] destroy task=0x5cafad1150e0 (task_id=0)
[2020/07/17 14:29:22] [debug] [storage] [cio file] synced at: tail.0/31137-1595010561.575441987.flb
[2020/07/17 14:29:23] [debug] [task] created task=0x5cafad17fa40 id=0 OK
[2020/07/17 14:29:23] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:29:23] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:29:23] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:29:23] [ info] [engine] flush backlog chunk '30392-1595010555.961076422.flb' succeeded: task_id=0, input=storage_backlog.1 > output=stackdriver.0
[2020/07/17 14:29:23] [debug] [task] destroy task=0x5cafad17fa40 (task_id=0)
[2020/07/17 14:29:23] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:29:23] [debug] [storage] tail.0:31137-1595010563.326386283.flb mapped OK
[2020/07/17 14:29:24] [debug] [task] created task=0x5cafad180a60 id=0 OK
[2020/07/17 14:29:24] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:29:24] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:29:24] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:29:24] [debug] [task] destroy task=0x5cafad180a60 (task_id=0)
[2020/07/17 14:29:24] [debug] [storage] [cio file] synced at: tail.0/31137-1595010563.326386283.flb
[2020/07/17 14:29:25] [error] [src/flb_scheduler.c:48 errno=0] Success
[2020/07/17 14:29:25] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:29:25] [debug] [storage] tail.0:31137-1595010565.652132014.flb mapped OK
[2020/07/17 14:29:26] [debug] [task] created task=0x5cafad17fc60 id=0 OK
[2020/07/17 14:29:26] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:29:26] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:29:26] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:29:26] [debug] [task] destroy task=0x5cafad17fc60 (task_id=0)
[2020/07/17 14:29:26] [debug] [storage] [cio file] synced at: tail.0/31137-1595010565.652132014.flb
[2020/07/17 14:29:28] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:29:28] [debug] [storage] tail.0:31137-1595010568.43277302.flb mapped OK
[2020/07/17 14:29:28] [debug] [task] created task=0x5cafad17fc60 id=0 OK
[2020/07/17 14:29:28] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:29:28] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:29:28] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:29:28] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:29:28] [debug] [storage] tail.0:31137-1595010568.577112829.flb mapped OK
[2020/07/17 14:29:28] [debug] [task] destroy task=0x5cafad17fc60 (task_id=0)
[2020/07/17 14:29:28] [debug] [storage] [cio file] synced at: tail.0/31137-1595010568.43277302.flb
[2020/07/17 14:29:29] [debug] [task] created task=0x5cafad17fc60 id=0 OK
[2020/07/17 14:29:29] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)**
[2020/07/17 14:34:15] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/07/17 14:34:15] [error] [http_client] broken connection to logging.googleapis.com:443 ?
[2020/07/17 14:34:15] [ warn] [output:stackdriver:stackdriver.0] http_do=-1
[2020/07/17 14:34:15] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:34:15] [debug] [input:tail:tail.0] scanning path /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log
[2020/07/17 14:34:15] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log, inode 3192531
[2020/07/17 14:34:15] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:34:15] [debug] [storage] tail.0:31137-1595010855.384208524.flb mapped OK
[2020/07/17 14:34:15] [debug] [storage] [cio file] alloc_size from 4096 to 36864
[2020/07/17 14:34:15] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been disconnected by the remote service
[2020/07/17 14:34:15] [debug] [retry] new retry created for task_id=0 attemps=1
[2020/07/17 14:34:15] [ warn] [engine] failed to flush chunk '31137-1595010568.577112829.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=stackdriver.0
[2020/07/17 14:34:16] [debug] [task] created task=0x5cafad17f800 id=1 OK
[2020/07/17 14:34:16] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:34:16] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:34:16] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:34:16] [debug] [storage] tail.0:31137-1595010856.429208524.flb mapped OK
[2020/07/17 14:34:16] [debug] [task] destroy task=0x5cafad17f800 (task_id=1)
[2020/07/17 14:34:16] [debug] [storage] [cio file] synced at: tail.0/31137-1595010855.384208524.flb
[2020/07/17 14:34:17] [debug] [task] created task=0x5cafad17f800 id=1 OK
[2020/07/17 14:34:17] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:34:17] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:34:17] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:34:17] [debug] [task] destroy task=0x5cafad17f800 (task_id=1)
[2020/07/17 14:34:17] [debug] [storage] [cio file] synced at: tail.0/31137-1595010856.429208524.flb
[2020/07/17 14:34:18] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:34:18] [debug] [storage] tail.0:31137-1595010858.396208524.flb mapped OK
[2020/07/17 14:34:19] [debug] [task] created task=0x5cafad17f800 id=1 OK
[2020/07/17 14:34:19] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:34:19] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:34:19] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:34:19] [debug] [task] destroy task=0x5cafad17f800 (task_id=1)
[2020/07/17 14:34:19] [debug] [storage] [cio file] synced at: tail.0/31137-1595010858.396208524.flb
[2020/07/17 14:34:20] [debug] [input:tail:tail.0] inode=3192531 events: IN_MODIFY 
[2020/07/17 14:34:20] [debug] [storage] tail.0:31137-1595010860.324208524.flb mapped OK
[2020/07/17 14:34:21] [debug] [task] created task=0x5cafad17f4a0 id=1 OK
[2020/07/17 14:34:21] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)
[2020/07/17 14:34:21] [debug] [output:stackdriver:stackdriver.0] HTTP Status=200
[2020/07/17 14:34:21] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 is now available
[2020/07/17 14:34:21] [debug] [input:tail:tail.0] scanning path /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log
[2020/07/17 14:34:21] [debug] [input:tail:tail.0] scan_blog add(): dismissed: /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log, inode 3192531
[2020/07/17 14:34:21] [debug] [task] destroy task=0x5cafad17f4a0 (task_id=1)
[2020/07/17 14:34:21] [debug] [storage] [cio file] synced at: tail.0/31137-1595010860.324208524.flb
[2020/07/17 14:34:22] [error] [src/flb_scheduler.c:48 errno=0] Success
^C[engine] caught signal (SIGINT)
[2020/07/17 14:34:23] [ info] [input] pausing tail.0
[2020/07/17 14:34:23] [ info] [input] pausing storage_backlog.1
[2020/07/17 14:34:23] [debug] [input:tail:tail.0] inode=3192531 removing file name /var/lib/docker/containers/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf/edbace40bae66c75883ea9a41bb1cb1f561a75262abc19d7f28c7c8034e4f7cf-json.log
[2020/07/17 14:34:23] [ info] inotify_fs_remove(): inode=3192531 watch_fd=1
[2020/07/17 14:34:23] [debug] [task] destroy task=0x5cafad17fc60 (task_id=0)
[2020/07/17 14:34:23] [debug] [storage] [cio file] synced at: tail.0/31137-1595010568.577112829.flb
[2020/07/17 14:34:23] [debug] [retry] task retry=0x5cafad04ea90, invalidated from the scheduler

After turning off Wifi, the last log generated by Fluent Bit is

[2020/07/17 14:29:29] [debug] [upstream] KA connection #41 to logging.googleapis.com:443 has been assigned (recycled)

and then after turning on Wifi Fluent Bit will stuck on this state for unknown time… I have tried this many times and sometimes Fluent Bit will response in ~17 minutes and sometimes ~5 mins. The next few lines of logs since it stuck are:

[2020/07/17 14:34:15] [error] [tls] SSL error: NET - Connection was reset by peer
[2020/07/17 14:34:15] [error] [http_client] broken connection to logging.googleapis.com:443 ?
  • Steps to reproduce the problem:

Expected behavior

I expect Fluent Bit to respond when the internet connection is on and no logs are lost during the offline period.

Screenshots

Your Environment

  • Version used: Fluent Bit v1.5
  • Configuration: The configuration file I was using for testing is:
[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    debug
    Parsers_File myparser.conf
    storage.path               /var/log/fluent-bit-buffers/
    storage.sync               normal
    storage.checksum           off
    storage.backlog.mem_limit  10M
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_PORT    2020

[INPUT]
    Name            tail
    Tag             k8s_container.test1.test2.offline
    Parser          JSON
    Path       PATH_TO_THE_LOG_FILE
    Buffer_Chunk_Size  512KB
    Buffer_Max_Size    5M
    Rotate_Wait        30
    Mem_Buf_Limit      5M
    Skip_Long_Lines    On
    Refresh_Interval   10
    storage.type       filesystem
    Ignore_Older       4h

[OUTPUT]
    Name stackdriver
    Match k8s_container.*
    Resource k8s_container
    k8s_cluster_name test_cluster_name
    k8s_cluster_location  test_cluster_location
    google_service_credentials /home/jeffluoo/fluentbit/gke-config.json
    Retry_Limit  False
    tls          On
    tls.verify   Off
    net.connect_timeout  30
    net.keepalive_idle_timeout  30

[FILTER]
    Name record_modifier
    Match k8s_container.*
    Record logging.googleapis.com/local_resource_id k8s_container.test1.test2.testfast
  • Environment name and version (e.g. Kubernetes? What version?):
  • Server type and version:
  • Operating System and version:
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Filters and plugins: Tail + Record_Modifier + Stackdriver

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (24 by maintainers)

Most upvoted comments

Hi @edsiper Yes I think this ticket can be closed as the issue was resolved.