fluent-bit: in_tail: plugin does not pickup rotated logs under heavy load

Bug Report

Describe the bug tail_fs_event receives IN_Q_OVERFLOW inotify events from time to time, thus missing IN_MOVE_SELF events.

To Reproduce tail a lot of files by pattern with heavy writing to them. The setup I have reads around 30 nginx access log files by pattern.

example cfg:

[SERVICE]
    Flush           1
    Daemon          off
    Log_Level       error
    Parsers_File vertis_parsers.conf
    HTTP_Server  On
    HTTP_Listen  127.0.0.1
    HTTP_Port    2020
[INPUT]
    Name tail
    Tag tail_nginx
    Path /var/log/nginx/*/access.log
    DB /var/lib/fluent-bit/nginx.sqlite
    Parser nginx_vertis
    Buffer_Chunk_Size 256kb
    Buffer_Max_Size 256kb
    Mem_Buf_Limit 64mb
[FILTER]
    Name record_modifier
    Match *
    Record _level INFO
    Record _service nginx
    Record _container_name ${HOSTNAME}
[OUTPUT]
    Name http
    Match *
    Port 10223
    URI /fluent-bit
    Format json
    Retry_Limit False

Example fluent-bit log with traces (had to add some extra output to tackle the problem)

[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log read=9428 lines=8
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log event
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log read=1449 lines=2
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=2277 lines=8
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d1/access.log read=1037 lines=1
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=761 lines=3
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d4/access.log read=9603 lines=12
[2019/02/10 08:55:30] [trace] inotify_mask: 4000
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d3/access.log read=894 lines=3
[2019/02/10 08:55:30] [trace] inotify_mask: 2
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log event
[2019/02/10 08:55:30] [debug] [in_tail] file=/var/log/nginx/d2/access.log read=438 lines=1
[2019/02/10 08:55:30] [trace] inotify_mask: 2

sysctl params related to inotify:

fs.inotify.max_queued_events = 16384
fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 524288

Expected behavior reload files after log-rotate

Your Environment

  • Version used: 1.0.3
  • Operating System and version: ubuntu 16.04
  • Filters and plugins: in_tail

Additional context We try to collect access logs on several nginx machines. Every morning after log-rotate tail plugin fails to pick up most of rename events (more often all of them). Basically, looks like it happens because files are tailed synchronously with receiving inotify events (reading, filling up buffer, processing content) while other events are piling up in the queue.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 12
  • Comments: 16 (7 by maintainers)

Most upvoted comments

We had the same issue running fluent-bit on EKS. Logs came at a rate of 50 k req/s. After updating fluent-bit to v1.6.8, bumping up the resources, and using a combination of memory and filesystem for buffering messages, we were able to avoid the log loss. We also modified the log rotation on EKS to rotate files only after reaching 10 GB. Here is the fluent-bit config we have:

  fluent-bit-input.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  1
        Skip_Long_Lines   On
        Buffer_Chunk_Size 10MB
        Buffer_Max_Size 10MB
        Rotate_wait 60
        storage.type filesystem
        DB /var/log/flb_kube.db
        Read_from_Head On
        Mem_Buf_Limit 500MB
  fluent-bit-output.conf: |2

    [OUTPUT]
        Name kafka
        Match *
        Brokers bg-kafka-private-cp-kafka.default.svc.cluster.local:9092
        Topics kubernetes_cluster
        Timestamp_Format  iso8601
        storage.total_limit_size  10G
        rdkafka.batch.num.messages 10000
        rdkafka.batch.size 10000000
        rdkafka.request.required.acks 1
        rdkafka.linger.ms 2000
        rdkafka.compression.codec lz4
  fluent-bit-service.conf: |
    [SERVICE]
        Flush        1
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Parsers_File parsers_custom.conf
        storage.path /var/log/flb-storage/
        storage.backlog.mem_limit 500MB

@l2dy, no it did not because there’s inotify version of this code which is running in fluent-bit https://github.com/fluent/fluent-bit/blob/f6bb10a07a630243b12f3d5db16e996f30b10786/plugins/in_tail/tail_fs_inotify.c#L159-L160