fluent-bit: FluentBit v1.7.9 to v1.8.0 jump introduces hangs and crashes for high throughput logs
Bug Report
Describe the bug
We are doing a performance test to send logs to Amazon Kinesis Firehose using the core kinesis_firehose
plugin. We are sending logs at a rate of 3Mb/sec for 300 seconds where each record is ~1Kb. FluentBit versions v1.8.0 and higher seem to get stuck/crash. However older versions work fine and can send all the logs to Kinesis Firehose without any issue. We started our testing from v1.7.0 and gradually moved to v1.9.0. The crash started showing up when we move from v1.7.9 to v1.8.0. Following are the version summaries:
v1.7.0:
works fine.v1.7.5:
works fine.v1.7.9:
works fine.v1.8.0:
crashes.v1.8.5:
crashes.v1.9.0:
crashes (tested on latest code base).
To Reproduce We are testing on an Amazon Linux 2 machine. Fluent Bit is present on that EC2, and sends logs to a Kinesis Firehose delivery stream in the same AWS account.
Fluent Bit installation: We cloned a specific branch from GitHub and build it following the README guideline. Then we ran it with our config file. We used the following commands:
- git clone https://github.com/fluent/fluent-bit.git
- cd fluent-bit/build
- cmake …
- make
- bin/fluent-bit -c /data/fluent-bit.conf
fluent-bit.conf:
[SERVICE]
Flush 5
Daemon off
Log_Level trace
[INPUT]
Name tail
Path /data/flb-test/*.log
refresh_interval 2
rotate_wait 5
db /data/flb-test/fluentbit-logs.db
db.sync normal
db.locking true
buffer_chunk_size 128KB
buffer_max_size 50MB
skip_long_lines on
mem_buf_limit 199800KB
[OUTPUT]
name kinesis_firehose
match *
region us-west-2
delivery_stream fluent-bit-perf-test
workers 5
Retry_Limit 20
Fluent Bit tail input read logs from a local file /data/flb-test/input.log
. We have a data generator script, which writes 3000 log records per second in the input file. It runs for continuous 300 seconds. Each log record is ~1Kb. So, a total of (1 * 3000 * 300)=900000 log records and the throughput is 3Mb/sec.
generate.sh:
#!/bin/bash
LARGE_TEXT="_Katherine Stone sphillips@shields.net 760-16-6504 2006-05-09T07:04:56 1974-01-19T07:04:25 Griffinborough Clinical biochemist wuPmcdIIIlPKQddacCTDMHedOKrxhgUOTyVDUjmExZqqwRmGKSwakHiKDMTlQzSvmnNmSgsJkJmtpHDBkICOrKNiNGJYftCIgNuQopxZZMXxGGXLUyCNyuWhzCKUCuuXKhxotmyulQExiufWfjQdiDwbRDUctByhAZcJPrlbGlInbpYcwCAQeJJBOZOEnKlsAOqYtNAueXfAeXFzEtssxZUTVIFTjjlspeJiBggwYuAtwlXzSScLcQNkkFUtCpGZhdPVrpiyNlmdcKkqpIjQIVjRmnnKBvzOSSPvXHLzhOeRzApvmaJtJIkYYLMhftbLioTbnWpXGIzkMzWRVimRrCJkRaqtttLcOiaOekPQgYdrByRPIZMMwfTKdftfZPHKIHJeYryUljCVolxZFdYDihpRHHFJlEwvfViRouHYZPcUihkbnVSkQLGlGzLPHLodovHJjrVqifkdNssyspCGpGHFNSeLguqpWIxMWhJMLkLDizCtqAOzDccveDvowyLrRlEECVGjqrkFTIIIOntAEXgnheqVqLnJaFBBTWcdKdlhzeixnfRZgmorTXKxeHaDDGZAWPhIpiRArxXQkArJdBjCtGNuOoDdNBMgGTLKbbEnsXWUiuEZXELXpbOfJuBSnuaAtUOlTafuSgmoPgMUjBYeaIOplfSzqRNKLnZBzZrzoAmxnfvSosZbDefkeabOAvtrXNFHYdORBJyuzjtpURVmTRzMKjwJhImztWkGcFYEqRpEjfeWfLelRaUNTiLPINZhEfcZagnLPcZrjPAATOVLOiSpwooU F Engineer 3 105000 500 ";
UNIQUE_ID_COUNTER=1000000;
i=0;
while [ $i -lt 300 ]; # number of iteration
do
j=0;
while [ $j -lt 3000 ];
do
LOG_LINE="";
LOG_LINE="$UNIQUE_ID_COUNTER$LARGE_TEXT";
echo $LOG_LINE;
let UNIQUE_ID_COUNTER=$UNIQUE_ID_COUNTER+1;
let j=j+1;
done
let i=i+1;
sleep 1;
done;
Then we run the following command to write data to our input.log
file.
data/generate.sh > /data/flb-test/input.log
Expected behavior FluentBit should sent all data to Kinesis Firehose without any error/crash.
Your Environment
- Version used: [v1.8.0, v1.8.5, v1.9.0]
- Environment name and version (e.g. Kubernetes? What version?): Amazon EC2
- Operating System and version: Amazon Linux 2
- Filters and plugins: tail input plugin, kinesis_firehsoe output plugin.
Additional context Similar open issues: #3917
Error logs Error before crash from v1.8.5.
[2021/08/31 23:45:47] [debug] [retry] new retry created for task_id=96 attempts=1
[2021/08/31 23:45:47] [ warn] [engine] failed to flush chunk ‘31033-1630453495.461571278.flb’, retry in 9 seconds: task_id=96, input=tail.0 > output=kinesis_firehose.0 (out_id=0)
[2021/08/31 23:45:47] [error] [tls] SSL error: NET - Connection was reset by peer
[2021/08/31 23:45:47] [debug] [rayhan_debug] calling flb_tls_net_write_async
[2021/08/31 23:45:47] [debug] [rayhan_debug] flb_io_net_write error: body_len > 0
[2021/08/31 23:45:47] [error] [src/flb_http_client.c:1176 errno=32] Broken pipe
[2021/08/31 23:45:47] [debug] [rayhan_debug] response
[2021/08/31 23:45:47] [debug] [aws_client] firehose.us-west-2.amazonaws.com: http_do=-1, HTTP Status: 0
[2021/08/31 23:45:47] [trace] [upstream] destroy connection #-1 to firehose.us-west-2.amazonaws.com:443
[2021/08/31 23:45:47] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records to fluent-bit-perf-test
[2021/08/31 23:45:47] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send log records
[2021/08/31 23:45:47] [error] [output:kinesis_firehose:kinesis_firehose.0] Failed to send records
[2021/08/31 23:45:47] [ warn] [net] getaddrinfo(host=‘firehose.us-west-2.amazonaws.com’, err=12): Timeout while contacting DNS servers
[2021/08/31 23:45:47] [debug] [upstream] connection #-1 failed to firehose.us-west-2.amazonaws.com:443
[2021/08/31 23:45:47] [trace] [upstream] destroy connection #-1 to firehose.us-west-2.amazonaws.com:443
[2021/08/31 23:45:47] [engine] caught signal (SIGSEGV)
#0 0x46c275 in ???() at ???:0
#1 0x46c2ac in ???() at ???:0
#2 0x46ce63 in ???() at ???:0
#3 0x46cec5 in ???() at ???:0
#4 0x46d19b in ???() at ???:0
#5 0x46d697 in ???() at ???:0
#6 0x55cec5 in ???() at ???:0
#7 0x55cad8 in ???() at ???:0
#8 0x51c4dc in ???() at ???:0
#9 0x51aea9 in ???() at ???:0
#10 0x51b205 in ???() at ???:0
#11 0x51b572 in ???() at ???:0
#12 0x519bb1 in ???() at ???:0
#13 0x45873b in ???() at ???:0
#14 0x7abfc6 in ???() at ???:0
Segmentation fault
Output with Valgrind for similar version v1.8.5.
[2021/09/01 00:17:49] [trace] [net] connection #88 in process to firehose.us-west-2.amazonaws.com:443
[2021/09/01 00:17:47] [trace] [task 0x5fb8cd0] created (id=64)
[2021/09/01 00:17:50] [trace] [net] connection #94 in process to firehose.us-west-2.amazonaws.com:443
[2021/09/01 00:17:47] [debug] [task] created task=0x5fb8cd0 id=64 OK
[2021/09/01 00:17:47] [debug] [output:kinesis_firehose:kinesis_firehose.0] task_id=64 assigned to thread #4
==31590== Thread 8 flb-out-kinesis_:
==31590== Jump to the invalid address stated on the next line
==31590== at 0x0: ???
==31590== by 0x46EA12: step_callback (in /home/ec2-user/fluent-bit/build/bin/fluent-bit)
==31590== by 0x4E3F40A: start_thread (in /usr/lib64/libpthread-2.26.so)
==31590== by 0x5AA50BE: clone (in /usr/lib64/libc-2.26.so)
==31590== Address 0x0 is not stack'd, malloc'd or (recently) free'd
==31590==
[2021/09/01 00:17:48] [debug] [input:tail:tail.0] scanning path /data/flb-test/*.log
[2021/09/01 00:17:53] [engine] caught signal (SIGSEGV)
Error log from v1.8.0.
Host: firehose.us-west-2.amazonaws.com
Content-Length: 652057
User-Agent: aws-fluent-bit-plugin
Content-Type: application/x-amz-json-1.1
X-Amz-Target: Firehose_20150804.PutRecordBatch
x-amz-date: 20210901T194003Z
Authorization: AWS4-HMAC-SHA256 Credential=AKIA3WKOYFPNQIOLLNUV/20210901/us-west-2/firehose/aws4_request, SignedHeaders=content-length;content-type;host;user-agent;x-amz-date;x-amz-target, Signature=19f897258d01353a4ece92d932366b9addb58a9edb52aeae56590fd01221c7dc
[2021/09/01 19:40:03] [trace] [io coro=0x7f42c8d7e9e0] [net_write] trying 503 bytes
[2021/09/01 19:40:03] [trace] [io coro=0x7f42c8d7e9e0] [net_write] ret=0 total=503/503
[2021/09/01 19:40:03] [trace] [io coro=0x7f42c8d7e9e0] [net_write] trying 652057 bytes
[2021/09/01 19:40:03] [trace] [engine] resuming coroutine=0x7f42c8973ef0
[2021/09/01 19:40:03] [trace] [engine] resuming coroutine=0x7f42c84d23c0
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
#0 0x466ac1 in ???() at ???:0
#1 0x466af8 in ???() at ???:0
#2 0x4675ff in ???() at ???:0
#3 0x467661 in ???() at ???:0
#4 0x467937 in ???() at ???:0
#5 0x467e33 in ???() at ???:0
#6 0x54d5ee in ???() at ???:0
#7 0x54d201 in ???() at ???:0
#8 0x512fed in ???() at ???:0
#9 0x5119ba in ???() at ???:0
#10 0x511d16 in ???() at ???:0
#11 0x512083 in ???() at ???:0
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
[2021/09/01 19:40:03] [debug] [input:tail:tail.0] inode=103 events: IN_MODIFY
#12 0x510642 in ???() at ???:0
#13 0x454288 in ???() at ???:0
#14 0x785b26 in ???() at ???:0
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 18 (16 by maintainers)
Commits related to this issue
- net: Fix handling of upstream connection timeout events. #4040 Signed-off-by: Ramya <89954993+krispraws@users.noreply.github.com> — committed to krispraws/fluent-bit by krispraws 3 years ago
- net: Fix handling of upstream connection timeout events. #4040 Signed-off-by: Ramya <89954993+krispraws@users.noreply.github.com> — committed to krispraws/fluent-bit by krispraws 3 years ago
Still seeing a lot of
broken pipe
andconnection
errors with the latest (v1.8.9) version.