fluent-bit: invalid stream_id x, could not append content to multiline context

Bug Report

Describe the bug Since upgrading to 1.8.8 using the fluent/fluent-bit Helm chart, I see these errors in the logs of essentially every daemonset pod.

To Reproduce I’m not actually sure what’s causing this. Is there a way to make FluentBit show me some information about what’s causing this exactly? I would also be ok with a way to disable multiline completely while still parsing structured logs from our services. (We log single JSON lines for the most part, and multiline isn’t necessary.)

Expected behavior FluentBit doesn’t log tons of errors.

Screenshots Screenshot 2021-10-14 at 18 32 30

Your Environment

Version used: 1.8.8
Configuration:

serviceMonitor:
  enabled: true

dashboards:
  enabled: true
  labelKey: grafana_dashboard
  annotations: {}

config:
  inputs: |
    [INPUT]
        Name             tail
        Path             /var/log/containers/*.log
        multiline.parser docker, cri
        Tag              kube.*
        Mem_Buf_Limit    50MB
        Skip_Long_Lines  Off
        Skip_Empty_Lines On
    [INPUT]
        Name systemd
        Tag host.*
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Read_From_Tail On
  outputs: |
    [OUTPUT]
        Name                es
        Match               kube.*
        Host                x
        Port                443
        Logstash_Format     On
        Retry_Limit         False
        Time_Key            @timestamp
        Replace_Dots        On
        Logstash_Prefix     kubernetes_cluster
        Logstash_DateFormat %y.%m.%d.%H
        tls                 on
        tls.verify          on
        tls.debug           1
        Trace_Error         On
        Suppress_Type_Name  On

Environment name and version (e.g. Kubernetes? What version?): Kubernetes v1.21.4

Additional context I’m trying to discover if there’s something wrong on our end emitting logs or a problem with FluentBit. Thanks in advance!

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 16

Commits related to this issue

in_tail: create stream_id by file inode(#4190) If stream_id is created by filename, rotated file id will be same. It causes releasing new multiline instance after file rotation. Signed-off-by: Takah... — committed to nokute78/fluent-bit by nokute78 3 years ago
in_tail: create stream_id by file inode(#4190) If stream_id is created by filename, rotated file id will be same. It causes releasing new multiline instance after file rotation. Signed-off-by: Takah... — committed to nokute78/fluent-bit by nokute78 3 years ago
in_tail: create stream_id by file inode(#4190) If stream_id is created by filename, rotated file id will be same. It causes releasing new multiline instance after file rotation. Signed-off-by: Takah... — committed to nokute78/fluent-bit by nokute78 3 years ago
in_tail: create stream_id by file inode(#4190) If stream_id is created by filename, rotated file id will be same. It causes releasing new multiline instance after file rotation. Signed-off-by: Takah... — committed to fluent/fluent-bit by nokute78 3 years ago
in_tail: create stream_id by file inode(#4190) If stream_id is created by filename, rotated file id will be same. It causes releasing new multiline instance after file rotation. Signed-off-by: Takah... — committed to fluent/fluent-bit by nokute78 3 years ago
in_tail: create stream_id by file inode(#4190) If stream_id is created by filename, rotated file id will be same. It causes releasing new multiline instance after file rotation. Signed-off-by: Takah... — committed to 0Delta/fluent-bit by nokute78 3 years ago

Most upvoted comments

We too are seeing this issue on our production server after starting to use multiline.parser cri. Using fluent-bit version 1.8.8 on kubernetes 1.21.4. Running fluent-bit as a daemonset. Installed through the official helm chart (0.19.1)

The errors always seem to start appearing on log rotation. This is how the typical log looks:

fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inode=2304023 handle rotation(): /var/log/containers/xyz-v2-67866f8657-7vhnw_prd_xyz-v2-e07b9f25a6fe6cfc23664406a102cf5f83994f1345a8a2d48c2e8976643a0cbf.log => /var/log/pods/prd_xyz-v2-67866f8657-7vhnw_df972d75-2a31-4979-9203-5ccbf3b1a3c1/xyz-v2/0.log.20211014-183627
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2304023 watch_fd=38
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2304023 watch_fd=42 name=/var/log/pods/prd_xyz-v2-67866f8657-7vhnw_df972d75-2a31-4979-9203-5ccbf3b1a3c1/xyz-v2/0.log.20211014-183627
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2304022 watch_fd=43 name=/var/log/containers/xyz-v2-67866f8657-7vhnw_prd_xyz-v2-e07b9f25a6fe6cfc23664406a102cf5f83994f1345a8a2d48c2e8976643a0cbf.log
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:42] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2304023 watch_fd=42
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:42] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
...

I set Rotate_Wait to 15 seconds. That is why the inotify_fs_remove() entry lags by 15 seconds from the log rotation entries. Immediately after the inotify_fs_remove() the errors start appearing

The same for another one:

fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inode=2825607 handle rotation(): /var/log/containers/xyz-mail-7b84b65fc8-9fjc9_prd_xyz-mail-3b451c8d00cea53af69a49adef3eae1c20021672d386f845383514ffcf5e4063.log => /var/log/pods/prd_xyz-mail-7b84b65fc8-9fjc9_70036b92-7094-46e4-89d1-10a45406d146/xyz-mail/0.log.20211014-183744
fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2825607 watch_fd=16
fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2825607 watch_fd=22 name=/var/log/pods/prd_xyz-mail-7b84b65fc8-9fjc9_70036b92-7094-46e4-89d1-10a45406d146/xyz-mail/0.log.20211014-183744
fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2825637 watch_fd=23 name=/var/log/containers/xyz-mail-7b84b65fc8-9fjc9_prd_xyz-mail-3b451c8d00cea53af69a49adef3eae1c20021672d386f845383514ffcf5e4063.log
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:03] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2825607 watch_fd=22
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
...

After the errors start happening, no more logs are being processed for the pod the log files were rotated for until the fluent-bit daemon is restarted

ggermis on Oct 14, 2021

I found a root cause. stream_id is calculated from filename, so in_tail creates same stream_id after rotation. Then in_tail tries to delete old stream_id(= new id), it causes deleting new stream_id instance.

I added below diff to print old/new stream_id.

diff --git a/plugins/in_tail/tail_file.c b/plugins/in_tail/tail_file.c
index 958bd58d..59433dea 100644
--- a/plugins/in_tail/tail_file.c
+++ b/plugins/in_tail/tail_file.c
@@ -894,6 +894,7 @@ int flb_tail_file_append(char *path, struct stat *st, int mode,
             goto error;
         }
         file->ml_stream_id = stream_id;
+        flb_error("new stream_id= %lu", stream_id);
     }
 
     /* Local buffer */
diff --git a/src/multiline/flb_ml_stream.c b/src/multiline/flb_ml_stream.c
index 9220e7b9..1e6b442a 100644
--- a/src/multiline/flb_ml_stream.c
+++ b/src/multiline/flb_ml_stream.c
@@ -289,7 +289,7 @@ void flb_ml_stream_id_destroy_all(struct flb_ml *ml, uint64_t stream_id)
     struct flb_ml_group *group;
     struct flb_ml_stream *mst;
     struct flb_ml_parser_ins *parser_i;
-
+    flb_error("%s stream_id=%lu", __FUNCTION__, stream_id);
     /* groups */
     mk_list_foreach(head, &ml->groups) {
         group = mk_list_entry(head, struct flb_ml_group, _head);

Then the log indicates these stream_ids are same.

Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/17 10:48:07] [ info] [engine] started (pid=19267)
[2021/10/17 10:48:07] [ info] [storage] version=1.1.4, initializing...
[2021/10/17 10:48:07] [ info] [storage] root path '/tmp/fluent-bit-data/'
[2021/10/17 10:48:07] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/17 10:48:07] [ info] [storage] backlog input plugin: storage_backlog.1
[2021/10/17 10:48:07] [ info] [cmetrics] version=0.2.2
[2021/10/17 10:48:07] [ info] [input:tail:tail.0] multiline core started
[2021/10/17 10:48:07] [error] new stream_id= 11907403896869813768
[2021/10/17 10:48:07] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 488.3M
[2021/10/17 10:48:07] [ info] [sp] stream processor started
[2021/10/17 10:48:07] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1703983 watch_fd=1 name=/tmp/logs/x.log
[0] kube.tmp.logs.x.log: [1634435253.383217136, {"time"=>"2021-10-17T01:47:33.383217136Z", "stream"=>"stdout", "_p"=>"F", "log"=>"testing..."}]
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inode=1703983 handle rotation(): /tmp/logs/x.log => /tmp/logs/x.log.1
[2021/10/17 10:48:12] [error] new stream_id= 11907403896869813768
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1703983 watch_fd=1
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1703983 watch_fd=2 name=/tmp/logs/x.log.1
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1703973 watch_fd=3 name=/tmp/logs/x.log
[2021/10/17 10:48:21] [error] flb_ml_stream_id_destroy_all stream_id=11907403896869813768
[2021/10/17 10:48:21] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1703983 watch_fd=2
^C[2021/10/17 10:48:28] [engine] caught signal (SIGINT)
[2021/10/17 10:48:28] [ info] [input] pausing tail.0
[2021/10/17 10:48:28] [ info] [input] pausing storage_backlog.1
[2021/10/17 10:48:28] [ warn] [engine] service will stop in 5 seconds
[2021/10/17 10:48:32] [ info] [engine] service stopped
[2021/10/17 10:48:32] [error] flb_ml_stream_id_destroy_all stream_id=11907403896869813768
[2021/10/17 10:48:32] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1703973 watch_fd=3

nokute78 on Oct 17, 2021

Confirming this looks to be fixed, thanks!

Tinche on Oct 29, 2021

I am able to reproduce the issue (both with the 1.8.8 build and on master) using the following config

Configuration

~/fluent-bit.conf:

[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    info
    HTTP_Monitor Off
    HTTP_Port    2020
    storage.path              /tmp/fluent-bit-data/
    storage.sync              normal
    storage.checksum          off
    storage.max_chunks_up     128
    storage.backlog.mem_limit 512M
    storage.metrics           on

[INPUT]
    Name              tail
    Path              /tmp/logs/*.log
    multiline.parser  cri
    Tag               kube.*
    Refresh_Interval  1
    Mem_Buf_Limit     50MB
    Buffer_Chunk_Size 1MB
    Buffer_Max_Size   2MB
    Skip_Empty_Lines  On
    Skip_Long_Lines   On
    Rotate_Wait       5
    DB                /tmp/tail-containers-state.db
    DB.Sync           Normal
    storage.type      filesystem

[OUTPUT]
    Name stdout
    Match *

~/fluent-bit.logrotate:

/tmp/logs/*.log {
	rotate 5
	monthly
	rotate 12
	compress
	delaycompress
	missingok
	notifempty
	create 644 root root
}

The logrotate needs to create a new file (inode) on rotation to match the kubelet log rotation, so no copytruncate

Steps to reproduce the error:

Create the initial log file

$ echo "$(date -u +'%Y-%m-%dT%H:%M:%S.%NZ') stdout F testing..." >> /tmp/logs/x.log

Startup fluent-bit

$ bin/fluent-bit -c ~/fluent-bit.conf
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/15 13:29:55] [ info] [engine] started (pid=127012)
[2021/10/15 13:29:55] [ info] [storage] version=1.1.4, initializing...
[2021/10/15 13:29:55] [ info] [storage] root path '/tmp/fluent-bit-data/'
[2021/10/15 13:29:55] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/15 13:29:55] [ info] [storage] backlog input plugin: storage_backlog.1
[2021/10/15 13:29:55] [ info] [cmetrics] version=0.2.2
[2021/10/15 13:29:55] [ info] [input:tail:tail.0] multiline core started
[2021/10/15 13:29:55] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 488.3M
[2021/10/15 13:29:55] [ info] [sp] stream processor started
[2021/10/15 13:29:55] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521755 watch_fd=1 name=/tmp/logs/x.log

Force a log rotation

$ logrotate --force ~/fluent-bit.logrotate

fluent-bit will output

[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inode=521755 handle rotation(): /tmp/logs/x.log => /tmp/logs/x.log.1
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=1
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521755 watch_fd=2 name=/tmp/logs/x.log.1
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521756 watch_fd=3 name=/tmp/logs/x.log

After Rotate_Wait seconds it shows

[2021/10/15 13:31:40] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=2

Try to add another entry to the log

echo "$(date -u +'%Y-%m-%dT%H:%M:%S.%NZ') stdout F testing..." >> /tmp/logs/x.log

fluent-bit logs will now show the error

[2021/10/15 13:32:53] [error] [multiline] invalid stream_id 11907403896869813768, could not append content to multiline context

Notes

Added some log statements to the flb_ml_stream_get method mentioned by @RalfWenzel and to the flb_ml_stream_id_destroy_all method. This is what I found

Adding an entry to the log file shows this output in fluent-bit:

streams empty? false
11907403896869813768: (null)
[0] kube.tmp.logs.x.log: [1634306811.731850320, {"time"=>"2021-10-15T14:06:51.731850320Z", "stream"=>"stdout", "_p"=>"F", "log"=>"testing..."}]

Forcing a log rotation shows

[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inode=521755 handle rotation(): /tmp/logs/x.log => /tmp/logs/x.log.1
streams empty? false
11907403896869813768: (null)
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=1
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521755 watch_fd=2 name=/tmp/logs/x.log.1
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521756 watch_fd=3 name=/tmp/logs/x.log

After the Rotate_wait seconds it shows

Destroying stream 11907403896869813768: (null)
[2021/10/15 14:07:08] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=2

Trying to add another entry to the log file after the rotation shows

streams empty? true
[2021/10/15 14:07:33] [error] [multiline] invalid stream_id 11907403896869813768, could not append content to multiline context

The changes to the 2 methods mentioned are (just added the fprintf statements):

struct flb_ml_stream *flb_ml_stream_get(struct flb_ml_parser_ins *parser,
                                        uint64_t stream_id)
{
    struct mk_list *head;
    struct flb_ml_stream *mst = NULL;

    fprintf(stdout, "streams empty? %s\n", ((mk_list_is_empty(&parser->streams) == 0) ? "true" : "false"));
    mk_list_foreach(head, &parser->streams) {
        mst = mk_list_entry(head, struct flb_ml_stream, _head);
        fprintf(stdout, "%lu: %s\n", mst->id, mst->name);
        if (mst->id == stream_id) {
            return mst;
        }
    }

    return NULL;
}

void flb_ml_stream_id_destroy_all(struct flb_ml *ml, uint64_t stream_id)
{
    struct mk_list *tmp;
    struct mk_list *head;
    struct mk_list *head_group;
    struct mk_list *head_stream;
    struct flb_ml_group *group;
    struct flb_ml_stream *mst;
    struct flb_ml_parser_ins *parser_i;

    /* groups */
    mk_list_foreach(head, &ml->groups) {
        group = mk_list_entry(head, struct flb_ml_group, _head);

        /* parser instances */
        mk_list_foreach(head_group, &group->parsers) {
            parser_i = mk_list_entry(head_group, struct flb_ml_parser_ins, _head);

            /* streams */
            mk_list_foreach_safe(head_stream, tmp, &parser_i->streams) {
                mst = mk_list_entry(head_stream, struct flb_ml_stream, _head);
                if (mst->id != stream_id) {
                    continue;
                }
                fprintf(stdout, "Destroying stream %lu: %s\n", mst->id, mst->name);

                /* flush any pending data */
                flb_ml_flush_parser_instance(ml, parser_i, stream_id);

                /* destroy internal groups of the stream */
                flb_ml_stream_destroy(mst);
            }
        }
    }
}

ggermis on Oct 15, 2021