fluent-bit: Log spam (leading to $$$) inside 2.1.0 within AWS when aws filter is used to acquire instance-id, fails, and used within cloudwatch plugin's log_stream_template
Bug Report
We had a cloud-spend cost explosion due to a prolific WARN level log combined with what appears to be a recent (unreleased) code-change.
To be clear; we failed to make-safe our fluent-bit. This ticket is to document what we did, and where we got taken by surprise, and where we would appreciate a guard-rail.
We failed to safely configure fluent-bit (this is entirely user-error) and
-
filled up our root filesystem as fluent-bit logged a warning per log-event processed to its own log file (which we had failed to set up logrotate on)
-
burned through a significant chunk of cloud spend by surprise during this event (which we could view as an unscheduled load test).
We normally ship ~4MB/day of the fluent-bit log file to CloudWatch Logs (spread across different AWS CloudWatch LogStreams, ID’d by instance-id to shard).
During this episode, we shipped and stored ~4TB, so about ~2.5TB/day.
So, ~1,000,000x (without noticing quickly - yes, we did say “oops” internally 🙀)
To Reproduce
-
Example log message if applicable:
Note: this is a sampling, and due to retries the sequence below may not be totally representative. The
record accessor
one was the spammiest, and (we think, but have not confirmed!) on every log-event seen by fluent-bit.[2023/02/14 hh:mm:ss] [error] [aws_client] failed to add header to request [2023/02/14 hh:mm:ss] [ warn] [imds] failed to retrieve metadata [2023/02/14 hh:mm:ss] [error] [filter:aws:identify_instance] Failed to get instance ID [2023/02/14 hh:mm:ss] [error] [filter:aws:identify_instance] Could not retrieve ec2 metadata from IMDS [2023/02/14 hh:mm:ss] [ warn] [record accessor] translation failed, root key=ec2_instance_id
-
Steps to reproduce the problem:
-
Configuration and environment as below.
-
Start fluent-bit.
-
tail -f /var/log/fluent-bit/fluent-bit.log
- observe fluent-bit log the record accessor warning prolifically. <-- this prolificness was the surprise for us
-
Expected behavior
-
fluent-bit is able to retrieve the instance-id from AWS ec2 metadata when IMDSv2 is in use with session-token required
(please note - we use the AWS config’s
credential_process
to acquire IAM permissions to write logs and create log streams) -
… but if it fails after a few contiguous failures, it either
- fails hard (crash)
- circuit-breaks and turns off that warning-log-message (or turns off the cloud-based outputs for that data, since those burn $)
- exponentially backs off trying to retrieve that AWS metadata
- rate-limits the warning being logged
… because the context is “I’m in AWS (because the aws filter is present), and this variable that the aws filter sets is used in this AWS log-stream-template, and that’s necessary to shard log storage to handle incoming load [or we would not have chosen to use a variable to shard the log-stream names]” … + … “I don’t want to leak $ storing the same message over and over”
In retrospect, it is clear we could have avoided this dangerous configuration, via multiple means:
- add log-rotate to cover fluent-bit’s own log file (avoid filling our root filesystem)
- log to a separate filesystem (avoid filling root filesystem)
- added a
throttle
filter to fluent-bit’s own log-filetail
input - we can reasonably see that “normal” per instance should not exceedx
log entries pery
time, configure the throttle to accept 2-3x that, and alert on “rate-limit has been reached for sustained period” - alerted on “sudden+sustained high rate of logs ingested to CloudWatch”
Screenshots
Your Environment
AWS environment
ec2 metadata v2 (ref), session-token authentication set to required
.
service.conf
# https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/classic-mode/configuration-file#config_section
# Gotcha: use spaces, not tabs.
[SERVICE]
Daemon off
log_file /var/log/fluent-bit/fluent-bit.log
# log_level can be debug/info/warn/error/off
# log_level here is the system-wide setting; can be supplied to a plugin to affect just that plugin.
log_level info
# Parsers File
# ============
# specify an optional 'Parsers' configuration file
parsers_file fluent-bit.d/parsers.conf
# Plugins File
# ============
# specify an optional 'Plugins' configuration file to load external plugins.
plugins_file fluent-bit.d/plugins.conf
# HTTP Server
# ===========
# Enable/Disable the built-in HTTP Server for metrics
http_server On
http_listen 0.0.0.0
http_port 2020
# https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit
Health_Check on
# Storage
# =======
# Fluent Bit can use memory and filesystem buffering based mechanisms
#
# - https://docs.fluentbit.io/manual/administration/buffering-and-storage
#
# storage metrics
# ---------------
# publish storage pipeline metrics in '/api/v1/storage'. The metrics are
# exported only if the 'http_server' option is enabled.
#
storage.metrics on
# storage.path
# ------------
# absolute file system path to store filesystem data buffers (chunks).
#
storage.path /var/fluent-bit/storage
# storage.sync
# ------------
# configure the synchronization mode used to store the data into the
# filesystem. It can take the values normal or full.
#
storage.sync normal
# storage.checksum
# ----------------
# enable the data integrity check when writing and reading data from the
# filesystem. The storage layer uses the CRC32 algorithm.
#
storage.checksum off
# storage.backlog.mem_limit
# -------------------------
# if storage.path is set, Fluent Bit will look for data chunks that were
# not delivered and are still in the storage layer, these are called
# backlog data. This option configure a hint of maximum value of memory
# to use when processing these records.
#
storage.backlog.mem_limit 5M
[INPUT]
# https://docs.fluentbit.io/manual/pipeline/inputs/tail
name tail
alias fluent_bit_app
tag fluent_bit_app.tail
parser fluent_bit_app
path /var/log/fluent-bit/fluent-bit.log
path_key log_path
DB /var/fluent-bit/fluent-bit.db
DB.Sync Full
DB.Locking true
# https://docs.fluentbit.io/manual/pipeline/filters/aws-metadata
[FILTER]
Name aws
alias identify_instance
Match *
imds_version v2
account_id false
ami_id false
az false
ec2_instance_id true
ec2_instance_type false
hostname false
private_ip false
vpc_id false
# https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch
[OUTPUT]
Name cloudwatch_logs
alias fluent_bit_app
Match fluent_bit_app.*
region ${FLB_AWS_REGION}
log_group_name ${FLB_AWS_FLUENT_BIT_APP_LOG_GROUP_NAME}
log_stream_name fluent_bit_app
log_stream_template fluent_bit_app_$ec2_instance_id
# log to a group that terraform manages
auto_create_group Off
log_format json
fluent-bit.d/parsers.conf
# https://docs.fluentbit.io/manual/pipeline/parsers/configuring-parser
[PARSER]
name fluent_bit_app
format regex
regex ^\[(?<ts>[^\]]+)\] \[ *(?<level>[^\]]+)\] \[(?<caller>[^\]]+)\] ?(?<msg>.*)$
time_key ts
# 2023/01/19 12:49:17
# https://docs.fluentbit.io/manual/pipeline/parsers/configuring-parser#configuration-parameters
time_format %Y/%m/%d %H:%M:%S
-
Version used: (unreleased) 2.1.0@HEAD - we bisected and think the change occurred inside https://github.com/fluent/fluent-bit/compare/75d8c15f991ba80256c191ea4b3e53bdfb9b9f91...c085a4bdf7951cc522b952dc4bbde2cb85f8a806
(we build from source, and have a daily sync from fluent-bit master@HEAD; we do this to afford us the chance to disable features that we don’t need inside the binary that we build, to reduce our vulnerability surface)
-
Configuration: a debug build c085a4bdf7951cc522b952dc4bbde2cb85f8a806
-
Environment name and version (e.g. Kubernetes? What version?): naked ec2 instances
-
Server type and version: linux x64/arm64
-
Operating System and version: ec2-provided debian 10 and 11
-
Filters and plugins: see config file above for filters. No plugins
Additional context
-
We burned through low-thousands $ on AWS CloudWatch Logs publishing and storage.
There are totally ways we could have detected this sooner and we’re implementing those with high priority.
-
We filled up our root filesystem with fluent-bit’s own log-file, causing instability to our server’s apps.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (8 by maintainers)
This is an AWS problem so the AWS repo issue matters more and we can keep that one open @petemounce
See comment here on case where FLB collects its own logs as a k8s daemonset pod: https://github.com/aws/aws-for-fluent-bit/issues/599#issuecomment-1811488846
One solution to this is to use grep filter to filter out these logs. Obviously, this has the drawback that then you will not get these error messages: