fluent-bit: Log spam (leading to $$$) inside 2.1.0 within AWS when aws filter is used to acquire instance-id, fails, and used within cloudwatch plugin's log_stream_template

Bug Report

We had a cloud-spend cost explosion due to a prolific WARN level log combined with what appears to be a recent (unreleased) code-change.

To be clear; we failed to make-safe our fluent-bit. This ticket is to document what we did, and where we got taken by surprise, and where we would appreciate a guard-rail.

We failed to safely configure fluent-bit (this is entirely user-error) and

filled up our root filesystem as fluent-bit logged a warning per log-event processed to its own log file (which we had failed to set up logrotate on)
burned through a significant chunk of cloud spend by surprise during this event (which we could view as an unscheduled load test).

We normally ship ~4MB/day of the fluent-bit log file to CloudWatch Logs (spread across different AWS CloudWatch LogStreams, ID’d by instance-id to shard).

During this episode, we shipped and stored ~4TB, so about ~2.5TB/day.

So, ~1,000,000x (without noticing quickly - yes, we did say “oops” internally 🙀)

To Reproduce

Example log message if applicable:

Note: this is a sampling, and due to retries the sequence below may not be totally representative. The record accessor one was the spammiest, and (we think, but have not confirmed!) on every log-event seen by fluent-bit.

[2023/02/14 hh:mm:ss] [error] [aws_client] failed to add header to request
[2023/02/14 hh:mm:ss] [ warn] [imds] failed to retrieve metadata
[2023/02/14 hh:mm:ss] [error] [filter:aws:identify_instance] Failed to get instance ID
[2023/02/14 hh:mm:ss] [error] [filter:aws:identify_instance] Could not retrieve ec2 metadata from IMDS
[2023/02/14 hh:mm:ss] [ warn] [record accessor] translation failed, root key=ec2_instance_id

Steps to reproduce the problem:
1. Configuration and environment as below.
2. Start fluent-bit.
3. tail -f /var/log/fluent-bit/fluent-bit.log
  - observe fluent-bit log the record accessor warning prolifically. <-- this prolificness was the surprise for us

Expected behavior

fluent-bit is able to retrieve the instance-id from AWS ec2 metadata when IMDSv2 is in use with session-token required

(please note - we use the AWS config’s credential_process to acquire IAM permissions to write logs and create log streams)
… but if it fails after a few contiguous failures, it either
1. fails hard (crash)
2. circuit-breaks and turns off that warning-log-message (or turns off the cloud-based outputs for that data, since those burn $)
3. exponentially backs off trying to retrieve that AWS metadata
4. rate-limits the warning being logged
… because the context is “I’m in AWS (because the aws filter is present), and this variable that the aws filter sets is used in this AWS log-stream-template, and that’s necessary to shard log storage to handle incoming load [or we would not have chosen to use a variable to shard the log-stream names]” … + … “I don’t want to leak $ storing the same message over and over”

In retrospect, it is clear we could have avoided this dangerous configuration, via multiple means:

add log-rotate to cover fluent-bit’s own log file (avoid filling our root filesystem)
log to a separate filesystem (avoid filling root filesystem)
added a throttle filter to fluent-bit’s own log-file tail input - we can reasonably see that “normal” per instance should not exceed x log entries per y time, configure the throttle to accept 2-3x that, and alert on “rate-limit has been reached for sustained period”
alerted on “sudden+sustained high rate of logs ingested to CloudWatch”

Screenshots

Your Environment

AWS environment

ec2 metadata v2 (ref), session-token authentication set to required.

`service.conf`

# https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/classic-mode/configuration-file#config_section

# Gotcha: use spaces, not tabs.
[SERVICE]
    Daemon      off
    log_file     /var/log/fluent-bit/fluent-bit.log
    # log_level can be debug/info/warn/error/off
    # log_level here is the system-wide setting; can be supplied to a plugin to affect just that plugin.
    log_level   info

    # Parsers File
    # ============
    # specify an optional 'Parsers' configuration file
    parsers_file fluent-bit.d/parsers.conf
    # Plugins File
    # ============
    # specify an optional 'Plugins' configuration file to load external plugins.
    plugins_file fluent-bit.d/plugins.conf

    # HTTP Server
    # ===========
    # Enable/Disable the built-in HTTP Server for metrics
    http_server  On
    http_listen  0.0.0.0
    http_port    2020

    # https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
    Health_Check on
    # Storage
    # =======
    # Fluent Bit can use memory and filesystem buffering based mechanisms
    #
    # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
    #
    # storage metrics
    # ---------------
    # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
    # exported only if the 'http_server' option is enabled.
    #
    storage.metrics on

    # storage.path
    # ------------
    # absolute file system path to store filesystem data buffers (chunks).
    #
    storage.path /var/fluent-bit/storage

    # storage.sync
    # ------------
    # configure the synchronization mode used to store the data into the
    # filesystem. It can take the values normal or full.
    #
    storage.sync normal

    # storage.checksum
    # ----------------
    # enable the data integrity check when writing and reading data from the
    # filesystem. The storage layer uses the CRC32 algorithm.
    #
    storage.checksum off

    # storage.backlog.mem_limit
    # -------------------------
    # if storage.path is set, Fluent Bit will look for data chunks that were
    # not delivered and are still in the storage layer, these are called
    # backlog data. This option configure a hint of maximum value of memory
    # to use when processing these records.
    #
    storage.backlog.mem_limit 5M

[INPUT]
    # https://docs.fluentbit.io/manual/pipeline/inputs/tail
    name       tail
    alias      fluent_bit_app
    tag        fluent_bit_app.tail
    parser     fluent_bit_app
    path       /var/log/fluent-bit/fluent-bit.log
    path_key   log_path

    DB         /var/fluent-bit/fluent-bit.db
    DB.Sync    Full
    DB.Locking true

# https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata
[FILTER]
    Name              aws
    alias             identify_instance
    Match             *
    imds_version      v2

    account_id        false
    ami_id            false
    az                false
    ec2_instance_id   true
    ec2_instance_type false
    hostname          false
    private_ip        false
    vpc_id            false

# https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch
[OUTPUT]
    Name                cloudwatch_logs
    alias               fluent_bit_app
    Match               fluent_bit_app.*
    region              ${FLB_AWS_REGION}
    log_group_name      ${FLB_AWS_FLUENT_BIT_APP_LOG_GROUP_NAME}
    log_stream_name     fluent_bit_app
    log_stream_template fluent_bit_app_$ec2_instance_id
    # log to a group that terraform manages
    auto_create_group   Off
    log_format          json

`fluent-bit.d/parsers.conf`

# https://docs.fluentbit.io/manual/pipeline/parsers/configuring-parser
[PARSER]
    name        fluent_bit_app
    format      regex
    regex       ^\[(?<ts>[^\]]+)\] \[ *(?<level>[^\]]+)\] \[(?<caller>[^\]]+)\] ?(?<msg>.*)$
    time_key    ts
    # 2023/01/19 12:49:17
    # https://docs.fluentbit.io/manual/pipeline/parsers/configuring-parser#configuration-parameters
    time_format %Y/%m/%d %H:%M:%S

Version used: (unreleased) 2.1.0@HEAD - we bisected and think the change occurred inside https://github.com/fluent/fluent-bit/compare/75d8c15f991ba80256c191ea4b3e53bdfb9b9f91...c085a4bdf7951cc522b952dc4bbde2cb85f8a806

(we build from source, and have a daily sync from fluent-bit master@HEAD; we do this to afford us the chance to disable features that we don’t need inside the binary that we build, to reduce our vulnerability surface)
Configuration: a debug build c085a4bdf7951cc522b952dc4bbde2cb85f8a806
Environment name and version (e.g. Kubernetes? What version?): naked ec2 instances
Server type and version: linux x64/arm64
Operating System and version: ec2-provided debian 10 and 11
Filters and plugins: see config file above for filters. No plugins

Additional context

We burned through low-thousands $ on AWS CloudWatch Logs publishing and storage.

There are totally ways we could have detected this sooner and we’re implementing those with high priority.
We filled up our root filesystem with fluent-bit’s own log-file, causing instability to our server’s apps.

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (8 by maintainers)

Most upvoted comments

This is an AWS problem so the AWS repo issue matters more and we can keep that one open @petemounce

PettitWesley on Jun 26, 2023

See comment here on case where FLB collects its own logs as a k8s daemonset pod: https://github.com/aws/aws-for-fluent-bit/issues/599#issuecomment-1811488846

One solution to this is to use grep filter to filter out these logs. Obviously, this has the drawback that then you will not get these error messages:

[record accessor] translation failed, root key=xxx

[FILTER]
    Name   grep
    Match app-firelens*
    Exclude  log translation

PettitWesley on Nov 14, 2023