fluentd: buffer space has too many data errors on k8s cluster

fluentd version 1.3
Your configuration

<match service>
  @type file
  path /fluentd/log/service.%Y-%m-%d.%H.log
  append true
  format single_value
  message_key log
  add_newline false
  <buffer>
    @type file
    path /fluentd/buffer/ps
    timekey 1h
    timekey_use_utc true
    timekey_wait 1m
#    chunk_limit_records 16777216
#    chunk_limit_size 256Mb
    flush_mode interval
    flush_interval 30s
  </buffer>
  compress gzip
  symlink_path /fluentd/log/service-latest.log
</match>

Problem description We run a daemonset of fluentd on our kubernetes cluster. We frequently see errors such as

fluentd-nrdqd fluentd 2019-05-12 13:40:30 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.3.0/lib/fluent/plugin/buffer.rb:269:in `write'" tag="kubernetes.var.log.containers.service-94596bc68-5hdj5_default_service-f5539e3f3956a646dcee49bcd57aeb1036a82baf0f210e1ad31f64d7cdd6471f.log"

The fluentd server itself runs on a dedicated system outside of the kubernetes cluster. We do see a few warnings on it from time to time

#0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=24.349883934482932 slow_flush_log_threshold=20.0 plugin_id="object:3f9090f98e80"

We’ve tried setting every setting for chunk_limit and flush settings to get rid of this error but it doesn’t seem to go away. Is there an obvious error in our configuration that we’re missing?

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 5
Comments: 16 (1 by maintainers)

Most upvoted comments

BufferOverflowError happens when output speed is slower than incoming traffic. So there are several approaches:

Scale up / scale out destination when writing data is slow
Increase flush_thread_count when the write latency is lower. Launch multiple threads can hide the latency.
Improve network setting. If network is unstable, the number of retry is increasing and it makes buffer flush slow.
Optimize buffer chunk limit or flush_interval for the destination. For example, kafka and mongodb have different characteristic for data ingestion.

+16

repeatedly on Jul 24, 2019

I have the same issue, surprisingly restart of fluend works for a while.

I would appreciate a guidance as well. It’s not clear what buffer is over and how to set size (for buffer/chunk/queue limit) properly. In my case, fluentbit forwards to fluentd that forwards to another fluentd. (the buffere overflow errors I see the most in the last fluentd in the row)

[328] kube.var.log.containers.fluentd-79cc4cffbd-d9cdg_sre_fluentd-dccc4f286753b75a53c464446af44ffcbeba5ad3a21c9a947a11e94f4c6892b2.log: [1560431258.193260514, {"log"=>"2019-06-13 13:07:38 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:269:in `write'" tag="raw.kube.app.obelix" [330] kube.var.log.containers.fluentd-79cc4cffbd-d9cdg_sre_fluentd-dccc4f286753b75a53c464446af44ffcbeba5ad3a21c9a947a11e94f4c6892b2.log: [1560431258.193283014, {"log"=>"2019-06-13 13:07:38 +0000 [warn]: #0 emit transaction failed: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/usr/lib/ruby/gems/2.5.0/gems/fluentd-1.2.6/lib/fluent/plugin/buffer.rb:269:in `write'" tag="kube.var.log.containers.obelix-j6h2n_ves-system_obelix-74bc7f7ecbcb9981c5f39eab9d85b855c5145f299d71d68ad4bef8f223653327.log"

+12

epcim on Jun 13, 2019

...
    <buffer>
      flush_thread_count 8
      flush_interval 1s
      chunk_limit_size 10M
      queue_limit_length 16
      retry_max_interval 30
      retry_forever true
    </buffer>
...

This solution worked for me.

srfaytkn on Nov 5, 2020

Is there any solution for this? If you continuously loose data this can’t be used in production.

cforce on Oct 18, 2020

Hoping to get some guidance on our setup… I am using elastic search for the logs… Initially Fluentd pod was throwing the following error: Worker 0 finished unexpectedly with signal SIGKILL Which was resolved after increasing the memory limit to 2Gi… Then we started getting a different fluentd error: [_cluster-elasticsearch_cluster-elasticsearch_elasticsearch] failed to write data into buffer by buffer overflow action=:throw_exception Attempted to resolve the error by tweaking the buffer settings, now we have the following: buffer: timekey: 1m timekey_wait: 30s timekey_use_utc: true chunk_limit_size: 16MB flush_mode: interval flush_interval: 5s flush_thread_count: 8 But I can still see that the buffers size on fluentd is 5.3G (not increasing since last two days) and every so often see the following error: [_cluster-elasticsearch_cluster-elasticsearch_elasticsearch] failed to write data into buffer by buffer overflow action=:throw_exception Buffers size seems to suggest that there are still logs waiting to be pushed to Elastic Search as well as an indication that fluentd is struggling to cope with the logs coming from fluentbit… Please note that I do see some recent logs but not all in elastic search… Appreciate any suggestions…

ziaudin on Jan 6, 2020