fluent-bit: Trace information is scarce. Unable to troubleshoot output issues.

Bug Report

Describe the bug

My fluentbit (td-agent-bit) fails to flush chunks: [engine] failed to flush chunk '3743-1581410162.822679017.flb', retry in 617 seconds: task_id=56, input=systemd.1 > output=es.0. This is the only log entry that shows up. Trace logging is enabled but there is no log entry to help me further.

To Reproduce

Rubular link if applicable:
Example log message if applicable:

[2020/02/11 10:21:23] [trace] [thread 0x7f3b1c59bf50] created (custom data at 0x7f3b1c59bf78, size=64
[2020/02/11 10:21:23] [trace] [thread 0x7f3b1c59bee0] created (custom data at 0x7f3b1c59bf08, size=64
[2020/02/11 10:21:28] [trace] [engine] [task event] task_id=56 thread_id=3 return=RETRY
[2020/02/11 10:21:28] [debug] [retry] re-using retry for task_id=56 attemps=4
[2020/02/11 10:21:28] [trace] [thread] destroy thread=0x7f3b4cc7bad0 data=0x7f3b4cc7baf8
[2020/02/11 10:21:28] [ warn] [engine] failed to flush chunk '3743-1581410162.822679017.flb', retry in 617 seconds: task_id=56, input=systemd.1 > output=es.0
[2020/02/11 10:21:28] [trace] [task 0x7f3b1fdfe8a0] created (id=485)
[2020/02/11 10:21:28] [debug] [task] created task=0x7f3b1fdfe8a0 id=485 OK
[2020/02/11 10:21:28] [trace] [thread 0x7f3b1c59be70] created (custom data at 0x7f3b1c59be98, size=64
[2020/02/11 10:21:33] [trace] [thread 0x7f3b1c59bd90] created (custom data at 0x7f3b1c59bdb8, size=64

Steps to reproduce the problem:

Expected behavior An actual message containing information about what goes wrong.

Screenshots

Your Environment

Version used: 1.3.7
Configuration:

[SERVICE]
    Flush        5
    Daemon       Off
    Log_Level    trace
    Parsers_File parsers.conf
    Plugins_File plugins.conf
    HTTP_Server  Off
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020

[INPUT]
    Name tail
    Tag nginx
    Interval_Sec 1
    Path    /var/log/nginx/*

[INPUT]
    Name systemd
    Tag journald
    Interval_Sec 1

[OUTPUT]
    Name es
    Match *
    Host es1.example.com
    Port 9200
    Index fluentbit
    Retry_Limit False
    Type  _doc

Operating System and version: Ubuntu 18.04 fully up to date, td-agent-bit latest version
Filters and plugins: Default package, no additional plugins.

Additional context

I am trying to have fluentbit process and ship logs to my (IPv6 only) elasticsearch cluster. Firewall is not a problem here. Filebeat from the same VM is able to connect to my elasticsearch ingest node over the same port. Tracking connections on my es1.example.com shows that there is no incoming connections from my VM on port 9200.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 8
Comments: 24 (1 by maintainers)

Most upvoted comments

In my case, this was happening because I had some fields in my logs with the same name but of different types, and Elasticsearch was rejecting them. E.g.

{"ids": [123, 456]}
{"ids": [{"foo": 123, "bar": 456}]}

edit: Adding Trace_Error On in my Elasticsearch output helped me determine this.

[OUTPUT]
    Name es
    Match *
    Host es1.example.com
    Trace_Error On

+11

DavidWittman on Nov 15, 2020

I’m seeing same issue with Fluent Bit v1.4.2

For us there is no LB between fluentbit and ES. The debug/trace output even prints HTTP 200OK. But I still get the warning message. And I also I don’t see logs in ES.

msvbhat on Apr 24, 2020

@krucelee in my case log debug showed that the elasticsearch output was not able to send big post requests. When we fixed that everything worked like a charm… So my issue might or might not be related to this issue - but the problems visible in the log look exactly the same… hope that helps…

danischroeter on Mar 5, 2020

In my case the problem got visible when enabling debug log:

 [debug] [out_es] HTTP Status=413 URI=/_bulk
 [debug] [retry] new retry created for task_id=2 attemps=1
 [ warn] [engine] failed to flush chunk '95175-1583245223.91940987.flb', retry in 8 seconds: task_id=2, input=tail.1 > output=es.1

So we needed to change the allowed body size for those elasticsearch posts and now its working fine…

danischroeter on Mar 3, 2020

Hi, I had the same issue, i updated to 1.5.7, but no change. After 30 min, no data are pull up on ES. but if i remove kubernetes filter it’s OK. any ideas, ty

helicase-bzh on Sep 28, 2020

same issue here, enable debug shows “[error] [io] TCP connection failed: 10.104.11.198:9200 (Connection timed out)” but the es is connectable under curl

debu99 on May 4, 2020

@Pointer666 Elasticsearch itself accepts pretty big documents so no. We have a loadbalancer in the way that accepted only 1MB of data. (maxBodySize)

danischroeter on Mar 6, 2020