fluentd: Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost.

Describe the bug v1.12 only. Fluentd process 100% CPU usage on a single node. Log events lost. No other nodes fail and continue to log to the same store. This is a critical issue: 100% result in co-located pods restarting, and loss of log events. We have rolled back to v1.11 on all clusters.

To Reproduce Unknown. There are no log entries that give indication as to why this occurs. These events occur multiple time per day on different nodes and in multiple clusters. There is no indication of the root cause. There are no indicative events logged by fluentd, elacticsearch or the wider kubernetes environment. We have looked very hard over many weeks and the root cause still evades us even with log level debug.

Expected behavior Reload/refresh connection to store. Events not lost. Improved diagnostics…

It should be noted that calling the /api/plugins.flushBuffers endpoint often causes the buffer to be written successfully and CPU usage to return to normal.

Your Environment

AWS EKS Cluster 1.19.6 Fluentd daemonset v1.12.3 Elasticseach plugin 5.0.3 & 4.1.4

Note this is seen in multiple clusters.

Having rolled back to v1.11 (ES 4.1.1) the issue goes away (identical configuration).

See this link for full details.

https://github.com/uken/fluent-plugin-elasticsearch/issues/885

Having created a v1.12.3/v4.1.4 image and seen the same issues repeated I no longer believe that this is a plugin issue. Rather that this a reconnect/buffer write issue introduces with v1.12.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 21 (10 by maintainers)

Commits related to this issue

Fix a bug that fluentd gets stuck on DNS lookup due to resolv's bug See following links for more detail: * https://bugs.ruby-lang.org/issues/17748 * https://github.com/fluent/fluentd/issues/3382 The... — committed to fluent/fluent-package-builder by ashie 3 years ago
Fix a bug that fluentd gets stuck on DNS lookup due to resolv's bug See following links for more detail: * https://bugs.ruby-lang.org/issues/17748 * https://github.com/fluent/fluentd/issues/3382 The... — committed to fluent/fluent-package-builder by ashie 3 years ago

Most upvoted comments

I’m now suspecting the following Ruby’s issue:

https://bugs.ruby-lang.org/issues/17781

And related excon’s issue:

https://github.com/excon/excon/issues/747

ashie on May 27, 2021

td-agent 4.2.0 has been released: https://www.fluentd.org/blog/td-agent-v4.2.0-has-been-released Sorry for the delay.

ashie on Aug 13, 2021

We’ll close this after we release td-agent 4.2.0 (it will ship with Ruby 2.7.4).

ashie on Jul 27, 2021

@andrew-pickin-epi me too , We downgrade v1.11, it look ok . It’s an amazing question

ndj888 on Jun 1, 2021