fluent-bit: Unrecoverable error "caught signal (SIGSEGV)" in the forward output
Bug Report
Describe the bug I’m seeing this issue with forward output plugin and restarting fluentbit won’t fix it. I have to temporarily change the output to null then revert it back to mitigate. I was using v1.7.9 and updated the image to v1.8.3 on the fly (and still saw this issue).
To Reproduce
- Example log message if applicable:
[2021/08/10 18:36:01] [error] [upstream] connection #-1 to fluentd.pipeline:24224 timed out after 10 seconds
[2021/08/10 18:36:01] [error] [upstream] connection #-1 to fluentd.pipeline:24224 timed out after 10 seconds
[2021/08/10 18:36:01] [engine] caught signal (SIGSEGV)
#0 0x55dd33597564 in mk_event_add() at lib/monkey/mk_core/mk_event.c:96
#1 0x55dd330b6f22 in net_connect_async() at src/flb_network.c:369
#2 0x55dd330b7bf2 in flb_net_tcp_connect() at src/flb_network.c:832
#3 0x55dd330dd254 in flb_io_net_connect() at src/flb_io.c:89
#4 0x55dd330c2eb1 in create_conn() at src/flb_upstream.c:497
#5 0x55dd330c337b in flb_upstream_conn_get() at src/flb_upstream.c:640
#6 0x55dd3313e726 in cb_forward_flush() at plugins/out_forward/forward.c:1183
#7 0x55dd330ad0de in output_pre_cb_flush() at include/fluent-bit/flb_output.h:490
#8 0x55dd335999a6 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9 0x7fcce18671f5 in ???() at ???:0 (edited)
- Steps to reproduce the problem: Not sure how to repro this, but have seen this few times.
Expected behavior Fluentbit should be able to recover gracefully.
Screenshots
Your Environment
- Version used: v1.7.9/v1.8.3
- Configuration:
[SERVICE]
Flush 1
Log_Level info
Parsers_File /fluent-bit/etc/parsers.conf
Parsers_File /forwarder/etc/parsers_custom.conf
Plugins_File /fluent-bit/etc/plugins.conf
HTTP_Server On
storage.path /var/log/flb-storage/
storage.max_chunks_up 128
storage.backlog.mem_limit 256M
storage.metrics on
[INPUT]
Name tail
Tag kubernetes.*
Path /var/log/containers/*.log
Parser cri
DB /var/log/flb-tail.db
DB.sync normal
Refresh_Interval 15
Read_from_Head On
Buffer_Chunk_Size 128K
Buffer_Max_Size 128K
Skip_Long_Lines On
Mem_Buf_Limit 256M
storage.type filesystem
[FILTER]
Name kubernetes
Match kubernetes.var.log.containers.*
Kube_Tag_Prefix kubernetes.var.log.containers.
Annotations Off
K8S-Logging.Exclude On
[OUTPUT]
Name forward
Match kubernetes.*
Host aggregator
Port 24224
Retry_Limit False
Require_ack_response True
storage.total_limit_size 16G
net.keepalive on
net.keepalive_max_recycle 300
- Environment name and version (e.g. Kubernetes? What version?): 1.19.x
- Server type and version:
- Operating System and version:
- Filters and plugins: tail, kubernetes,forward
Additional context
From @edsiper, the fluenbit team is triaging a similar issue slack thread
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 18 (8 by maintainers)
If it does repro would you be able to capture the chunk file?
On Mon, Feb 28 2022 at 6:36 PM, panaji < @.*** > wrote:
@senior88oqz, I tried with 1.7.9 and 1.8.3 (current latest) and both have the same issue … so, i think anything in between would be the same