fluent-bit: Unrecoverable error "caught signal (SIGSEGV)" in the forward output

Bug Report

Describe the bug I’m seeing this issue with forward output plugin and restarting fluentbit won’t fix it. I have to temporarily change the output to null then revert it back to mitigate. I was using v1.7.9 and updated the image to v1.8.3 on the fly (and still saw this issue).

To Reproduce

Example log message if applicable:

[2021/08/10 18:36:01] [error] [upstream] connection #-1 to fluentd.pipeline:24224 timed out after 10 seconds
[2021/08/10 18:36:01] [error] [upstream] connection #-1 to fluentd.pipeline:24224 timed out after 10 seconds
[2021/08/10 18:36:01] [engine] caught signal (SIGSEGV)
#0  0x55dd33597564      in  mk_event_add() at lib/monkey/mk_core/mk_event.c:96
#1  0x55dd330b6f22      in  net_connect_async() at src/flb_network.c:369
#2  0x55dd330b7bf2      in  flb_net_tcp_connect() at src/flb_network.c:832
#3  0x55dd330dd254      in  flb_io_net_connect() at src/flb_io.c:89
#4  0x55dd330c2eb1      in  create_conn() at src/flb_upstream.c:497
#5  0x55dd330c337b      in  flb_upstream_conn_get() at src/flb_upstream.c:640
#6  0x55dd3313e726      in  cb_forward_flush() at plugins/out_forward/forward.c:1183
#7  0x55dd330ad0de      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:490
#8  0x55dd335999a6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#9  0x7fcce18671f5      in  ???() at ???:0 (edited)

Steps to reproduce the problem: Not sure how to repro this, but have seen this few times.

Expected behavior Fluentbit should be able to recover gracefully.

Screenshots

Your Environment

Version used: v1.7.9/v1.8.3
Configuration:

[SERVICE]
    Flush                     1
    Log_Level                 info
    Parsers_File              /fluent-bit/etc/parsers.conf
    Parsers_File              /forwarder/etc/parsers_custom.conf
    Plugins_File              /fluent-bit/etc/plugins.conf
    HTTP_Server               On
    storage.path              /var/log/flb-storage/
    storage.max_chunks_up     128
    storage.backlog.mem_limit 256M
    storage.metrics           on
[INPUT]
    Name              tail
    Tag               kubernetes.*
    Path              /var/log/containers/*.log
    Parser            cri
    DB                /var/log/flb-tail.db
    DB.sync           normal
    Refresh_Interval  15
    Read_from_Head    On
    Buffer_Chunk_Size 128K
    Buffer_Max_Size   128K
    Skip_Long_Lines   On
    Mem_Buf_Limit     256M
    storage.type      filesystem
[FILTER]
    Name                kubernetes
    Match               kubernetes.var.log.containers.*
    Kube_Tag_Prefix     kubernetes.var.log.containers.
    Annotations         Off
    K8S-Logging.Exclude On
[OUTPUT]
    Name                       forward
    Match                      kubernetes.*
    Host                       aggregator
    Port                       24224
    Retry_Limit                False
    Require_ack_response       True
    storage.total_limit_size   16G
    net.keepalive              on
    net.keepalive_max_recycle  300

Environment name and version (e.g. Kubernetes? What version?): 1.19.x
Server type and version:
Operating System and version:
Filters and plugins: tail, kubernetes,forward

Additional context

From @edsiper, the fluenbit team is triaging a similar issue slack thread

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 18 (8 by maintainers)

Most upvoted comments

If it does repro would you be able to capture the chunk file?

On Mon, Feb 28 2022 at 6:36 PM, panaji < @.*** > wrote:

i still see this in 1.8.7 … we only recently deployed 1.8.12, but it’s not long enough to know if the issue still exists.

— Reply to this email directly, view it on GitHub ( https://github.com/fluent/fluent-bit/issues/3940#issuecomment-1054925574 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AA3ZL6XPEVL2SJB62WLZWYDU5QWKPANCNFSM5B7IL5EA ). Triage notifications on the go with GitHub Mobile for iOS ( https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 ) or Android ( https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub ). You are receiving this because you commented. Message ID: <fluent/fluent-bit/issues/3940/1054925574 @ github. com>

agup006 on Mar 1, 2022

@senior88oqz, I tried with 1.7.9 and 1.8.3 (current latest) and both have the same issue … so, i think anything in between would be the same

panaji on Aug 13, 2021