fluent-bit: FluentBit v1.7.9 to v1.8.0 jump introduces hangs and crashes for high throughput logs

Bug Report

Describe the bug We are doing a performance test to send logs to Amazon Kinesis Firehose using the core kinesis_firehose plugin. We are sending logs at a rate of 3Mb/sec for 300 seconds where each record is ~1Kb. FluentBit versions v1.8.0 and higher seem to get stuck/crash. However older versions work fine and can send all the logs to Kinesis Firehose without any issue. We started our testing from v1.7.0 and gradually moved to v1.9.0. The crash started showing up when we move from v1.7.9 to v1.8.0. Following are the version summaries:

  • v1.7.0: works fine.
  • v1.7.5: works fine.
  • v1.7.9: works fine.
  • v1.8.0: crashes.
  • v1.8.5: crashes.
  • v1.9.0: crashes (tested on latest code base).

To Reproduce We are testing on an Amazon Linux 2 machine. Fluent Bit is present on that EC2, and sends logs to a Kinesis Firehose delivery stream in the same AWS account.

Fluent Bit installation: We cloned a specific branch from GitHub and build it following the README guideline. Then we ran it with our config file. We used the following commands:

  1. git clone https://github.com/fluent/fluent-bit.git
  2. cd fluent-bit/build
  3. cmake …
  4. make
  5. bin/fluent-bit -c /data/fluent-bit.conf

fluent-bit.conf:

[SERVICE]
    Flush           5
    Daemon          off
    Log_Level       trace
[INPUT]
    Name              tail
    Path              /data/flb-test/*.log
    refresh_interval  2
    rotate_wait       5
    db                /data/flb-test/fluentbit-logs.db
    db.sync           normal
    db.locking        true
    buffer_chunk_size 128KB
    buffer_max_size   50MB
    skip_long_lines   on
    mem_buf_limit     199800KB
[OUTPUT]
    name              kinesis_firehose
    match             *
    region            us-west-2
    delivery_stream   fluent-bit-perf-test
    workers           5
    Retry_Limit       20

Fluent Bit tail input read logs from a local file /data/flb-test/input.log. We have a data generator script, which writes 3000 log records per second in the input file. It runs for continuous 300 seconds. Each log record is ~1Kb. So, a total of (1 * 3000 * 300)=900000 log records and the throughput is 3Mb/sec.

generate.sh:

#!/bin/bash

LARGE_TEXT="_Katherine Stone sphillips@shields.net 760-16-6504 2006-05-09T07:04:56 1974-01-19T07:04:25 Griffinborough Clinical biochemist wuPmcdIIIlPKQddacCTDMHedOKrxhgUOTyVDUjmExZqqwRmGKSwakHiKDMTlQzSvmnNmSgsJkJmtpHDBkICOrKNiNGJYftCIgNuQopxZZMXxGGXLUyCNyuWhzCKUCuuXKhxotmyulQExiufWfjQdiDwbRDUctByhAZcJPrlbGlInbpYcwCAQeJJBOZOEnKlsAOqYtNAueXfAeXFzEtssxZUTVIFTjjlspeJiBggwYuAtwlXzSScLcQNkkFUtCpGZhdPVrpiyNlmdcKkqpIjQIVjRmnnKBvzOSSPvXHLzhOeRzApvmaJtJIkYYLMhftbLioTbnWpXGIzkMzWRVimRrCJkRaqtttLcOiaOekPQgYdrByRPIZMMwfTKdftfZPHKIHJeYryUljCVolxZFdYDihpRHHFJlEwvfViRouHYZPcUihkbnVSkQLGlGzLPHLodovHJjrVqifkdNssyspCGpGHFNSeLguqpWIxMWhJMLkLDizCtqAOzDccveDvowyLrRlEECVGjqrkFTIIIOntAEXgnheqVqLnJaFBBTWcdKdlhzeixnfRZgmorTXKxeHaDDGZAWPhIpiRArxXQkArJdBjCtGNuOoDdNBMgGTLKbbEnsXWUiuEZXELXpbOfJuBSnuaAtUOlTafuSgmoPgMUjBYeaIOplfSzqRNKLnZBzZrzoAmxnfvSosZbDefkeabOAvtrXNFHYdORBJyuzjtpURVmTRzMKjwJhImztWkGcFYEqRpEjfeWfLelRaUNTiLPINZhEfcZagnLPcZrjPAATOVLOiSpwooU F Engineer 3 105000 500 ";
UNIQUE_ID_COUNTER=1000000;
i=0;

while [ $i -lt 300 ]; # number of iteration
do
    j=0;
    while [ $j -lt 3000 ];
    do
        LOG_LINE="";
        LOG_LINE="$UNIQUE_ID_COUNTER$LARGE_TEXT";
        echo $LOG_LINE;
        let UNIQUE_ID_COUNTER=$UNIQUE_ID_COUNTER+1;
        let j=j+1;
    done
    let i=i+1;
    sleep 1;
done;

Then we run the following command to write data to our input.log file.

  • data/generate.sh > /data/flb-test/input.log

Expected behavior FluentBit should sent all data to Kinesis Firehose without any error/crash.

Your Environment

  • Version used: [v1.8.0, v1.8.5, v1.9.0]
  • Environment name and version (e.g. Kubernetes? What version?): Amazon EC2
  • Operating System and version: Amazon Linux 2
  • Filters and plugins: tail input plugin, kinesis_firehsoe output plugin.

Additional context Similar open issues: #3917

Error logs Error before crash from v1.8.5.

[2021/08/31 23:45:47] [debug] [retry] new retry created for task_id=96 attempts=1
[2021/08/31 23:45:47] [ warn] [engine] failed to flush chunk ‘31033-1630453495.461571278.flb’, retry in 9 seconds: task_id=96, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/08/31 23:45:47] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/08/31 23:45:47] [debug] [rayhan_debug] calling flb_tls_net_write_async
[2021/08/31 23:45:47] [debug] [rayhan_debug] flb_io_net_write error: body_len > 0
[2021/08/31 23:45:47] [error] [src/flb_http_client.c:1176 errno=32] Broken pipe
[2021/08/31 23:45:47] [debug] [rayhan_debug] response 
[2021/08/31 23:45:47] [debug] [aws_client] firehose.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2021/08/31 23:45:47] [trace] [upstream] destroy connection #-1 to firehose.us-west-2.amazonaws.com:443
[2021/08/31 23:45:47] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/08/31 23:45:47] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/08/31 23:45:47] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/08/31 23:45:47] [ warn] [net] getaddrinfo(host=‘firehose.us-west-2.amazonaws.com’, err=12): Timeout while contacting DNS servers
[2021/08/31 23:45:47] [debug] [upstream] connection #-1 failed to firehose.us-west-2.amazonaws.com:443
[2021/08/31 23:45:47] [trace] [upstream] destroy connection #-1 to firehose.us-west-2.amazonaws.com:443
[2021/08/31 23:45:47] [engine] caught signal (SIGSEGV)
#0 0x46c275      in ???() at ???:0
#1 0x46c2ac      in ???() at ???:0
#2 0x46ce63      in ???() at ???:0
#3 0x46cec5      in ???() at ???:0
#4 0x46d19b      in ???() at ???:0
#5 0x46d697      in ???() at ???:0
#6 0x55cec5      in ???() at ???:0
#7 0x55cad8      in ???() at ???:0
#8 0x51c4dc      in ???() at ???:0
#9 0x51aea9      in ???() at ???:0
#10 0x51b205      in ???() at ???:0
#11 0x51b572      in ???() at ???:0
#12 0x519bb1      in ???() at ???:0
#13 0x45873b      in ???() at ???:0
#14 0x7abfc6      in ???() at ???:0
Segmentation fault

Output with Valgrind for similar version v1.8.5.

[2021/09/01 00:17:49] [trace] [net] connection #88 in process to firehose.us-west-2.amazonaws.com:443
[2021/09/01 00:17:47] [trace] [task 0x5fb8cd0] created (id=64)
[2021/09/01 00:17:50] [trace] [net] connection #94 in process to firehose.us-west-2.amazonaws.com:443
[2021/09/01 00:17:47] [debug] [task] created task=0x5fb8cd0 id=64 OK
[2021/09/01 00:17:47] [debug] [output:kinesis_firehose:kinesis_firehose.0] task_id=64 assigned to thread #4
==31590== Thread 8 flb-out-kinesis_:
==31590== Jump to the invalid address stated on the next line
==31590==    at 0x0: ???
==31590==    by 0x46EA12: step_callback (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==31590==    by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==31590==    by 0x5AA50BE: clone (in /usr/lib64/libc-2.26.so)
==31590==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==31590== 
[2021/09/01 00:17:48] [debug] [input:tail:tail.0] scanning path /data/flb-test/*.log
[2021/09/01 00:17:53] [engine] caught signal (SIGSEGV)

Error log from v1.8.0.

Host: firehose.us-west-2.amazonaws.com
Content-Length: 652057
User-Agent: aws-fluent-bit-plugin
Content-Type: application/x-amz-json-1.1
X-Amz-Target: Firehose_20150804.PutRecordBatch
x-amz-date: 20210901T194003Z
Authorization: AWS4-HMAC-SHA256 Credential=AKIA3WKOYFPNQIOLLNUV/20210901/us-west-2/firehose/aws4_request, SignedHeaders=content-length;content-type;host;user-agent;x-amz-date;x-amz-target, Signature=19f897258d01353a4ece92d932366b9addb58a9edb52aeae56590fd01221c7dc


[2021/09/01 19:40:03] [trace] [io coro=0x7f42c8d7e9e0] [net_write] trying 503 bytes
[2021/09/01 19:40:03] [trace] [io coro=0x7f42c8d7e9e0] [net_write] ret=0 total=503/503
[2021/09/01 19:40:03] [trace] [io coro=0x7f42c8d7e9e0] [net_write] trying 652057 bytes
[2021/09/01 19:40:03] [trace] [engine] resuming coroutine=0x7f42c8973ef0
[2021/09/01 19:40:03] [trace] [engine] resuming coroutine=0x7f42c84d23c0
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
#0  0x466ac1            in  ???() at ???:0
#1  0x466af8            in  ???() at ???:0
#2  0x4675ff            in  ???() at ???:0
#3  0x467661            in  ???() at ???:0
#4  0x467937            in  ???() at ???:0
#5  0x467e33            in  ???() at ???:0
#6  0x54d5ee            in  ???() at ???:0
#7  0x54d201            in  ???() at ???:0
#8  0x512fed            in  ???() at ???:0
#9  0x5119ba            in  ???() at ???:0
#10 0x511d16            in  ???() at ???:0
#11 0x512083            in  ???() at ???:0
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY 
#12 0x510642            in  ???() at ???:0
#13 0x454288            in  ???() at ???:0
#14 0x785b26            in  ???() at ???:0

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 18 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Still seeing a lot of broken pipe and connection errors with the latest (v1.8.9) version.

[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:36] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:36] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:37] [ warn] [engine] failed to flush chunk '7061-1636497561.773063216.flb', retry in 7 seconds: task_id=30, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:38] [ warn] [engine] failed to flush chunk '7061-1636497561.809548140.flb', retry in 11 seconds: task_id=35, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:38] [ warn] [engine] failed to flush chunk '7061-1636497562.855397523.flb', retry in 11 seconds: task_id=39, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:38] [error] [upstream] connection #119 to firehose.us-west-2.amazonaws.com:443 timed out after 10 seconds
[2021/11/09 22:39:38] [error] [aws_client] connection initialization error
[2021/11/09 22:39:38] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:38] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:38] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:39] [ warn] [engine] failed to flush chunk '7061-1636497563.955488763.flb', retry in 6 seconds: task_id=54, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:40] [ warn] [engine] failed to flush chunk '7061-1636497564.990213728.flb', retry in 9 seconds: task_id=59, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:41] [ warn] [engine] failed to flush chunk '7061-1636497565.24677815.flb', retry in 10 seconds: task_id=64, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/11/09 22:39:42] [error] [upstream] connection #168 to firehose.us-west-2.amazonaws.com:443 timed out after 10 seconds
[2021/11/09 22:39:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/11/09 22:39:43] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/11/09 22:39:43] [error] [src/flb_http_client.c:1172 errno=32] Broken pipe