fluent-bit: Log spam (leading to $$$) inside 2.1.0 within AWS when aws filter is used to acquire instance-id, fails, and used within cloudwatch plugin's log_stream_template

Bug Report

We had a cloud-spend cost explosion due to a prolific WARN level log combined with what appears to be a recent (unreleased) code-change.

To be clear; we failed to make-safe our fluent-bit. This ticket is to document what we did, and where we got taken by surprise, and where we would appreciate a guard-rail.

We failed to safely configure fluent-bit (this is entirely user-error) and

  • filled up our root filesystem as fluent-bit logged a warning per log-event processed to its own log file (which we had failed to set up logrotate on)

  • burned through a significant chunk of cloud spend by surprise during this event (which we could view as an unscheduled load test).

    We normally ship ~4MB/day of the fluent-bit log file to CloudWatch Logs (spread across different AWS CloudWatch LogStreams, ID’d by instance-id to shard).

    During this episode, we shipped and stored ~4TB, so about ~2.5TB/day.

    So, ~1,000,000x (without noticing quickly - yes, we did say “oops” internally 🙀)

To Reproduce

  • Example log message if applicable:

    Note: this is a sampling, and due to retries the sequence below may not be totally representative. The record accessor one was the spammiest, and (we think, but have not confirmed!) on every log-event seen by fluent-bit.

    [2023/02/14 hh:mm:ss] [error] [aws_client] failed to add header to request
    [2023/02/14 hh:mm:ss] [ warn] [imds] failed to retrieve metadata
    [2023/02/14 hh:mm:ss] [error] [filter:aws:identify_instance] Failed to get instance ID
    [2023/02/14 hh:mm:ss] [error] [filter:aws:identify_instance] Could not retrieve ec2 metadata from IMDS
    [2023/02/14 hh:mm:ss] [ warn] [record accessor] translation failed, root key=ec2_instance_id
    
  • Steps to reproduce the problem:

    1. Configuration and environment as below.

    2. Start fluent-bit.

    3. tail -f /var/log/fluent-bit/fluent-bit.log

      • observe fluent-bit log the record accessor warning prolifically. <-- this prolificness was the surprise for us

Expected behavior

  1. fluent-bit is able to retrieve the instance-id from AWS ec2 metadata when IMDSv2 is in use with session-token required

    (please note - we use the AWS config’s credential_process to acquire IAM permissions to write logs and create log streams)

  2. … but if it fails after a few contiguous failures, it either

    1. fails hard (crash)
    2. circuit-breaks and turns off that warning-log-message (or turns off the cloud-based outputs for that data, since those burn $)
    3. exponentially backs off trying to retrieve that AWS metadata
    4. rate-limits the warning being logged

    … because the context is “I’m in AWS (because the aws filter is present), and this variable that the aws filter sets is used in this AWS log-stream-template, and that’s necessary to shard log storage to handle incoming load [or we would not have chosen to use a variable to shard the log-stream names]” … + … “I don’t want to leak $ storing the same message over and over”

In retrospect, it is clear we could have avoided this dangerous configuration, via multiple means:

  • add log-rotate to cover fluent-bit’s own log file (avoid filling our root filesystem)
  • log to a separate filesystem (avoid filling root filesystem)
  • added a throttle filter to fluent-bit’s own log-file tail input - we can reasonably see that “normal” per instance should not exceed x log entries per y time, configure the throttle to accept 2-3x that, and alert on “rate-limit has been reached for sustained period”
  • alerted on “sudden+sustained high rate of logs ingested to CloudWatch”

Screenshots

Your Environment

AWS environment

ec2 metadata v2 (ref), session-token authentication set to required.

service.conf

# https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/classic-mode/configuration-file#config_section

# Gotcha: use spaces, not tabs.
[SERVICE]
    Daemon      off
    log_file     /var/log/fluent-bit/fluent-bit.log
    # log_level can be debug/info/warn/error/off
    # log_level here is the system-wide setting; can be supplied to a plugin to affect just that plugin.
    log_level   info

    # Parsers File
    # ============
    # specify an optional 'Parsers' configuration file
    parsers_file fluent-bit.d/parsers.conf
    # Plugins File
    # ============
    # specify an optional 'Plugins' configuration file to load external plugins.
    plugins_file fluent-bit.d/plugins.conf

    # HTTP Server
    # ===========
    # Enable/Disable the built-in HTTP Server for metrics
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

    # https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
    Health_Check on
    # Storage
    # =======
    # Fluent Bit can use memory and filesystem buffering based mechanisms
    #
    # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
    #
    # storage metrics
    # ---------------
    # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
    # exported only if the 'http_server' option is enabled.
    #
    storage.metrics on

    # storage.path
    # ------------
    # absolute file system path to store filesystem data buffers (chunks).
    #
    storage.path /var/fluent-bit/storage

    # storage.sync
    # ------------
    # configure the synchronization mode used to store the data into the
    # filesystem. It can take the values normal or full.
    #
    storage.sync normal

    # storage.checksum
    # ----------------
    # enable the data integrity check when writing and reading data from the
    # filesystem. The storage layer uses the CRC32 algorithm.
    #
    storage.checksum off

    # storage.backlog.mem_limit
    # -------------------------
    # if storage.path is set, Fluent Bit will look for data chunks that were
    # not delivered and are still in the storage layer, these are called
    # backlog data. This option configure a hint of maximum value of memory
    # to use when processing these records.
    #
    storage.backlog.mem_limit 5M

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail
    name       tail
    alias      fluent_bit_app
    tag        fluent_bit_app.tail
    parser     fluent_bit_app
    path       /var/log/fluent-bit/fluent-bit.log
    path_key   log_path

    DB         /var/fluent-bit/fluent-bit.db
    DB.Sync    Full
    DB.Locking true

# https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata
[FILTER]
    Name              aws
    alias             identify_instance
    Match             *
    imds_version      v2

    account_id        false
    ami_id            false
    az                false
    ec2_instance_id   true
    ec2_instance_type false
    hostname          false
    private_ip        false
    vpc_id            false

# https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch
[OUTPUT]
    Name                cloudwatch_logs
    alias               fluent_bit_app
    Match               fluent_bit_app.*
    region              ${FLB_AWS_REGION}
    log_group_name      ${FLB_AWS_FLUENT_BIT_APP_LOG_GROUP_NAME}
    log_stream_name     fluent_bit_app
    log_stream_template fluent_bit_app_$ec2_instance_id
    # log to a group that terraform manages
    auto_create_group   Off
    log_format          json

fluent-bit.d/parsers.conf

# https://docs.fluentbit.io/manual/pipeline/parsers/configuring-parser
[PARSER]
    name        fluent_bit_app
    format      regex
    regex       ^\[(?<ts>[^\]]+)\] \[ *(?<level>[^\]]+)\] \[(?<caller>[^\]]+)\] ?(?<msg>.*)$
    time_key    ts
    # 2023/01/19 12:49:17
    # https://docs.fluentbit.io/manual/pipeline/parsers/configuring-parser#configuration-parameters
    time_format %Y/%m/%d %H:%M:%S
  • Version used: (unreleased) 2.1.0@HEAD - we bisected and think the change occurred inside https://github.com/fluent/fluent-bit/compare/75d8c15f991ba80256c191ea4b3e53bdfb9b9f91...c085a4bdf7951cc522b952dc4bbde2cb85f8a806

    (we build from source, and have a daily sync from fluent-bit master@HEAD; we do this to afford us the chance to disable features that we don’t need inside the binary that we build, to reduce our vulnerability surface)

  • Configuration: a debug build c085a4bdf7951cc522b952dc4bbde2cb85f8a806

  • Environment name and version (e.g. Kubernetes? What version?): naked ec2 instances

  • Server type and version: linux x64/arm64

  • Operating System and version: ec2-provided debian 10 and 11

  • Filters and plugins: see config file above for filters. No plugins

Additional context

  • We burned through low-thousands $ on AWS CloudWatch Logs publishing and storage.

    There are totally ways we could have detected this sooner and we’re implementing those with high priority.

  • We filled up our root filesystem with fluent-bit’s own log-file, causing instability to our server’s apps.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

This is an AWS problem so the AWS repo issue matters more and we can keep that one open @petemounce

See comment here on case where FLB collects its own logs as a k8s daemonset pod: https://github.com/aws/aws-for-fluent-bit/issues/599#issuecomment-1811488846

One solution to this is to use grep filter to filter out these logs. Obviously, this has the drawback that then you will not get these error messages:

[record accessor] translation failed, root key=xxx
[FILTER]
    Name   grep
    Match app-firelens*
    Exclude  log translation