fluentd: Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost.

Describe the bug v1.12 only. Fluentd process 100% CPU usage on a single node. Log events lost. No other nodes fail and continue to log to the same store. This is a critical issue: 100% result in co-located pods restarting, and loss of log events. We have rolled back to v1.11 on all clusters.

To Reproduce Unknown. There are no log entries that give indication as to why this occurs. These events occur multiple time per day on different nodes and in multiple clusters. There is no indication of the root cause. There are no indicative events logged by fluentd, elacticsearch or the wider kubernetes environment. We have looked very hard over many weeks and the root cause still evades us even with log level debug.

Expected behavior Reload/refresh connection to store. Events not lost. Improved diagnostics…

It should be noted that calling the /api/plugins.flushBuffers endpoint often causes the buffer to be written successfully and CPU usage to return to normal.

Your Environment

AWS EKS Cluster 1.19.6 Fluentd daemonset v1.12.3 Elasticseach plugin 5.0.3 & 4.1.4

Note this is seen in multiple clusters.

Having rolled back to v1.11 (ES 4.1.1) the issue goes away (identical configuration).

See this link for full details.

https://github.com/uken/fluent-plugin-elasticsearch/issues/885

Having created a v1.12.3/v4.1.4 image and seen the same issues repeated I no longer believe that this is a plugin issue. Rather that this a reconnect/buffer write issue introduces with v1.12.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 21 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I’m now suspecting the following Ruby’s issue:

And related excon’s issue:

td-agent 4.2.0 has been released: https://www.fluentd.org/blog/td-agent-v4.2.0-has-been-released Sorry for the delay.

We’ll close this after we release td-agent 4.2.0 (it will ship with Ruby 2.7.4).

@andrew-pickin-epi me too , We downgrade v1.11, it look ok . It’s an amazing question