fluentd: Slow memory leak of Fluentd v0.14 compared to v0.12
Fluentd version: 0.14.25 Environment: running inside a debian:stretch-20180312 based container. Dockerfile: here
We noticed a slow memory leak that built up over a month or so.
The same setup that ran with Fluentd 0.12.41 have stable memory usage over the same period of time.
Still investigating and trying to narrow down versions. But wanna create a ticket to track this.
Config:
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/k8s-gcp-containers.log.pos
tag reform.*
read_from_head true
format multi_format
<pattern>
format json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</pattern>
<pattern>
format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
time_format %Y-%m-%dT%H:%M:%S.%N%:z
</pattern>
</source>
<filter reform.**>
@type parser
format /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<log>.*)/
reserve_data true
suppress_parse_error_log true
emit_invalid_record_to_error false
key_name log
</filter>
<match reform.**>
@type record_reformer
enable_ruby true
<record>
# Extract local_resource_id from tag for 'k8s_container' monitored
# resource. The format is:
# 'k8s_container.<namespace_name>.<pod_name>.<container_name>'.
"logging.googleapis.com/local_resource_id" ${"k8s_container.#{tag_suffix[4].rpartition('.')[0].split('_')[1]}.#{tag_suffix[4].rpartition('.')[0].split('_')[0]}.#{tag_suffix[4].rpartition('.')[0].split('_')[2].rpartition('-')[0]}"}
# Rename the field 'log' to a more generic field 'message'. This way the
# fluent-plugin-google-cloud knows to flatten the field as textPayload
# instead of jsonPayload after extracting 'time', 'severity' and
# 'stream' from the record.
message ${record['log']}
</record>
tag ${if record['stream'] == 'stderr' then 'stderr' else 'stdout' end}
remove_keys stream,log
</match>
<match fluent.**>
@type null
</match>
# This section is exclusive for k8s_container logs. These logs come with
# 'stderr'/'stdout' tags.
# We use a separate output stanza for 'k8s_node' logs with a smaller buffer
# because node logs are less important than user's container logs.
<match {stderr,stdout}>
@type google_cloud
# Try to detect JSON formatted log entries.
detect_json true
# Collect metrics in Prometheus registry about plugin activity.
enable_monitoring true
monitoring_type prometheus
# Allow log entries from multiple containers to be sent in the same request.
split_logs_by_tag false
# Set the buffer type to file to improve the reliability and reduce the memory consumption
buffer_type file
buffer_path /var/log/k8s-fluentd-buffers/kubernetes.containers.buffer
# Set queue_full action to block because we want to pause gracefully
# in case of the off-the-limits load instead of throwing an exception
buffer_queue_full_action block
# Set the chunk limit conservatively to avoid exceeding the recommended
# chunk size of 5MB per write request.
buffer_chunk_limit 1M
# Cap the combined memory usage of this buffer and the one below to
# 1MiB/chunk * (6 + 2) chunks = 8 MiB
buffer_queue_limit 6
# Never wait more than 5 seconds before flushing logs in the non-error case.
flush_interval 5s
# Never wait longer than 30 seconds between retries.
max_retry_wait 30
# Disable the limit on the number of retries (retry forever).
disable_retry_limit
# Use multiple threads for processing.
num_threads 2
use_grpc false
# Use Metadata Agent to get monitored resource.
enable_metadata_agent true
</match>
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 48 (15 by maintainers)
Commits related to this issue
- in_tail: Fix rotation related resource leak. fix #1941 Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com> — committed to fluent/fluentd by repeatedly 6 years ago
- Merge pull request #2105 from fluent/fix-in_tail-resource-leak in_tail: Fix rotation related resource leak. fix #1941 — committed to fluent/fluentd by repeatedly 6 years ago
Released v1.2.5. Thanks for the testing.
I released v1.2.5.rc1 for testing. You can install this version with
--pre
option ingem install
Just a thought, would log rotation contribute to the issue? As I thought about the difference between the two setups (k8s v.s. no k8s), this is the first thing that crossed my mind.
Current GKE log rotation happens when log file exceeds 10MB. At the load of 100kb/s, the log file is rotated every (10 * 1024 / 100 = 102) seconds.
I am experiencing the same problem. memory usage keeps growing up.
Environment: amazon linux 2 Fluentd version: starting fluentd-1.2.2 pid=1 ruby=“2.4.4”