fluent-bit: input forward: plugin doesn't close TCP port when it caught SIGTERM
Bug Report
Describe the bug
The input forward plugin doesn’t close TCP port when it caught SIGTERM. This causes losing some logs on shutdown.
To Reproduce
- Run a fluent-bit container with Docker:
$ docker run -it --rm --name flb -p 24224:24224 fluent/fluent-bit:1.5.6 /fluent-bit/bin/fluent-bit -v -i forward -o stdout -f 1
- Run a
nc
command (port scan) container with Docker fluent log driver and forward its log to the fluent-bit container:
$ docker run -it --rm --log-driver=fluentd --log-opt fluentd-address=127.0.0.1:24224 --link flb busybox sh -c 'while true; do date; nc -vz flb 24224; sleep 1; done'
- Stop the fluent-bit container:
$ date -u; time docker stop -t 10 flb; date -u
The outputs of command are as follows:
$ docker run -it --rm --name flb -p 24224:24224 fluent/fluent-bit:1.5.6 /fluent-bit/bin/fluent-bit -v -i forward -o stdout -f 1
Fluent Bit v1.5.6
* Copyright (C) 2019-2020 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2020/09/29 00:09:55] [ info] Configuration:
[2020/09/29 00:09:55] [ info] flush time | 1.000000 seconds
[2020/09/29 00:09:55] [ info] grace | 5 seconds
[2020/09/29 00:09:55] [ info] daemon | 0
[2020/09/29 00:09:55] [ info] ___________
[2020/09/29 00:09:55] [ info] inputs:
[2020/09/29 00:09:55] [ info] forward
[2020/09/29 00:09:55] [ info] ___________
[2020/09/29 00:09:55] [ info] filters:
[2020/09/29 00:09:55] [ info] ___________
[2020/09/29 00:09:55] [ info] outputs:
[2020/09/29 00:09:55] [ info] stdout.0
[2020/09/29 00:09:55] [ info] ___________
[2020/09/29 00:09:55] [ info] collectors:
[2020/09/29 00:09:55] [ info] [engine] started (pid=1)
[2020/09/29 00:09:55] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2020/09/29 00:09:55] [debug] [storage] [cio stream] new stream registered: forward.0
[2020/09/29 00:09:55] [ info] [storage] version=1.0.5, initializing...
[2020/09/29 00:09:55] [ info] [storage] in-memory
[2020/09/29 00:09:55] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2020/09/29 00:09:55] [debug] [in_fw] Listen='0.0.0.0' TCP_Port=24224
[2020/09/29 00:09:55] [ info] [input:forward:forward.0] listening on 0.0.0.0:24224
[2020/09/29 00:09:55] [debug] [router] default match rule forward.0:stdout.0
[2020/09/29 00:09:55] [ info] [sp] stream processor started
[0] e9ed8c030c0a: [1601338201.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"Tue Sep 29 00:10:01 UTC 2020
[1] e9ed8c030c0a: [1601338201.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"flb (172.17.0.2:24224) open
[2020/09/29 00:10:02] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:02] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
[0] e9ed8c030c0a: [1601338202.000000000, {"container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"Tue Sep 2", "container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b3e973"}]
[1] e9ed8c030c0a: [1601338202.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"flb (172.17.0.2:24224) open
[2020/09/29 00:10:03] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:03] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
", "container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b3e973", "container_name"=>"/laughing_feistel", "source"=>"stdout"}]
", "container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b3e973", "container_name"=>"/laughing_feistel", "source"=>"stdout"}]
[2020/09/29 00:10:04] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:04] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
[engine] caught signal (SIGTERM)
", "container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b3e973", "container_name"=>"/laughing_feistel", "source"=>"stdout"}]
[1] e9ed8c030c0a: [1601338204.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"flb (172.17.0.2:24224) open
[2020/09/29 00:10:04] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:04] [ warn] [engine] service will stop in 5 seconds
[2020/09/29 00:10:04] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
[0] e9ed8c030c0a: [1601338205.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"Tue Sep 29 00:10:05 UTC 2020
", "container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b3e973", "container_name"=>"/laughing_feistel"}]
[2020/09/29 00:10:06] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:06] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
[0] e9ed8c030c0a: [1601338206.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"Tue Sep 29 00:10:06 UTC 2020
[1] e9ed8c030c0a: [1601338206.000000000, {"container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"flb (172.", "container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b3e973"}]
[2020/09/29 00:10:07] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:07] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
[0] e9ed8c030c0a: [1601338207.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"Tue Sep 29 00:10:07 UTC 2020
[1] e9ed8c030c0a: [1601338207.000000000, {"container_id"=>"e9ed8c030c0af0df30f15bad9a5e62c49e78dfc581738c6ccf480457e0b"}]73", "container_name"=>"/laughing_feistel", "source"=>"stdout", "log"=>"flb (172.17.0.2:24224) open
[2020/09/29 00:10:08] [debug] [task] created task=0x7f4257a2f3a0 id=0 OK
[2020/09/29 00:10:08] [debug] [task] destroy task=0x7f4257a2f3a0 (task_id=0)
[2020/09/29 00:10:09] [ info] [engine] service stopped
$ docker run -it --rm --log-driver=fluentd --log-opt fluentd-address=127.0.0.1:24224 --link flb busybox sh -c 'while true; do date; nc -vz flb 24224; sleep 1; done'
Tue Sep 29 00:10:01 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:02 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:03 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:04 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:05 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:06 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:07 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:08 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:10:09 UTC 2020
$ date -u; time docker stop -t 10 flb; date -u
Tue Sep 29 00:10:04 UTC 2020
flb
docker stop -t 10 flb 0.04s user 0.02s system 1% cpu 4.409 total
Tue Sep 29 00:10:09 UTC 2020
According to the above output, the Docker tried to stop the fluent-bit container at 00:10:04. The engine caught signal (SIGTERM) at 00:10:04, but some logs are still ingested and the timestamp of the last log entry is 1601338207(=2020/09/29 00:10:07) and the result of nc command shows the TCP port was still open. The engine completed shutdown at 00:10:09 after grace period (5 sec)
I tested multiple times and I found its behavior seems to be inconsistent. The input was sometimes paused (but TCP port was still open).
[0] cfedccd0310b: [1601337750.000000000, {"container_id"=>"cfedccd0310b0b4b747d4bd6b8362d8d009ff26b24cdd59d3b58faa3ea3"}]42", "container_name"=>"/vigilant_aryabhata", "source"=>"stdout", "log"=>"Tue Sep 29 00:02:30 UTC 2020
", "container_id"=>"cfedccd0310b0b4b747d4bd6b8362d8d009ff26b24cdd59d3b58faa3ea300b42", "container_name"=>"/vigilant_aryabhata", "source"=>"stdout"}]
[2020/09/29 00:02:30] [debug] [task] created task=0x7f8ee0e2f3a0 id=0 OK
[2020/09/29 00:02:30] [debug] [task] destroy task=0x7f8ee0e2f3a0 (task_id=0)
[engine] caught signal (SIGTERM)
[2020/09/29 00:02:30] [ warn] [engine] service will stop in 5 seconds
[2020/09/29 00:02:31] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:31] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:32] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:32] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:33] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:33] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:34] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:34] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:35] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:35] [debug] [input chunk] forward.0 is paused, cannot append records
[2020/09/29 00:02:35] [ info] [engine] service stopped
Tue Sep 29 00:02:28 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:29 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:30 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:31 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:32 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:33 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:34 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:35 UTC 2020
flb (172.17.0.2:24224) open
Tue Sep 29 00:02:36 UTC 2020
nc: flb (172.17.0.2:24224): Connection timed out
Tue Sep 29 00:04:46 UTC 2020
$ date -u; time docker stop -t 10 flb; date -u
Tue Sep 29 00:02:30 UTC 2020
flb
docker stop -t 10 flb 0.04s user 0.02s system 1% cpu 5.190 total
Tue Sep 29 00:02:35 UTC 2020
Expected behavior
The input forward plugin should close TCP port whent it caught SIGTERM not to lose any logs on shutdown.
Your Environment
- Version used: v1.5.6
- Configuration: (passed with arguments described above the reproduction case)
- Environment name and version (e.g. Kubernetes? What version?): AWS ECS Fargate (platform version: 1.4.0)
- Server type and version: AWS
- Operating System and version: Docker (fluent/fluent-bit:1.5.6)
- Filters and plugins: input forward
Additional context
I’m running fluent-bit containers on AWS ECS Fargate behind AWS NLB and forwarding other container’s log with the Docker fluentd log driver. When I need to update my configuration for fluent-bit, I’m using the Blue-Green deployment strategy not to lose any logs during the deployment. However, AWS NLB doesn’t reset the existing TCP connection even if NLB’s deregistration delay passed. The deregistration delay block a new connection, but it doesn’t reset the existing one. If my understanding is correct, the grace period doesn’t ensure to flush all buffers. This means that if fluent-bit ingest new log entries during shutdown they will be lost.
I don’t think it only affects fluent-bit behind AWS NLB. I can reproduce a simple docker container setup described above. Blocking input on shutdown consistently may be a minimum solution, but it would be great if the fluent-bit close TCP port on shutdown because it allows sender to detect quickly receiver no longer available.
Related issue https://github.com/fluent/fluent-bit/issues/1496
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 31 (19 by maintainers)
Commits related to this issue
- bin: spawn pipeline using library mode (#1496 #2610) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to fluent/fluent-bit by edsiper 4 years ago
- in_forward: partial workaround to close connections on SIGTERM (#2610) This patch adds some extra handling for service termination, so when the engine paused ingestion, the server socket is close to ... — committed to fluent/fluent-bit by edsiper 4 years ago
- bin: spawn pipeline using library mode (#1496 #2610) Signed-off-by: Eduardo Silva <eduardo@treasure-data.com> — committed to xmcqueen/fluent-bit by edsiper 4 years ago
- in_forward: partial workaround to close connections on SIGTERM (#2610) This patch adds some extra handling for service termination, so when the engine paused ingestion, the server socket is close to ... — committed to sirwio/fluent-bit by edsiper 4 years ago
- Geoip (#1) * ra: fix typo of comment Signed-off-by: Takahiro YAMASHITA <nokute78@gmail.com> * build: add an option for OSS-Fuzz builds (#2502) This will make things a lot easier from the OSS... — committed to Kingyo/fluent-bit by Kingyo 4 years ago
- in_forward: fix handling of TCP connections on SIGTERM (#2610) When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open st... — committed to fluent/fluent-bit by edsiper 3 years ago
- in_forward: fix handling of TCP connections on SIGTERM (#2610) When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open st... — committed to fluent/fluent-bit by edsiper 3 years ago
- in_forward: fix handling of TCP connections on SIGTERM (#2610) When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open st... — committed to fluent/fluent-bit by edsiper 3 years ago
- in_forward: fix handling of TCP connections on SIGTERM (#2610) When Fluent Bit receives a SIGTERM, internally it was only taking care of the server socket leaving the active connections in an open st... — committed to 0Delta/fluent-bit by edsiper 3 years ago
troubleshooting…
@zhonghui12 @hossain-rayhan @DrewZhang13 @PettitWesley
@edsiper Ping- we are still seeing reports that Fluent Bit stops accepting records on the forward input when it receives a SIGTERM, which causes log loss. Customers prefer fluent bit to continue accepting records until SIGKILL. If that behavior is not your preferred default, then we request that this alternate behavior be an opt-in thing.
https://github.com/aws/aws-for-fluent-bit/issues/201