fluentd: CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3
Describe the bug When upgrading Fluentd 1.11.2 to the latest 1.12.3, our soak tests detected that the memory and CPU usage increased by a non-trivial percentage, resulting in Fluentd not being able to catch up with 1000 log entries / second throughput (a baseline of the soak test).
The diff between the two is:
gem uninstall fluentd -ax --force
gem install fluentd -v 1.12.3
Expected behavior CPU and memory usage is stable and log entries are not dropped.
Your Environment Fluentd runs inside a Debian 9 based container in GKE with fixed throughput of 1000 log entries per second. (In total 3 VMs, which is why in total there are 3000 log entries per second).
Your Configuration https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (11 by maintainers)
I’m setting up more experiments for the versions in between these two to narrow down the root cause.
Looks like
1.13.3
has better memory usage:Glad that helped. I just set up an experiment with
1.12.4
as well. Will report back once we get some results. It typically takes a few days for the issue to start revealing itself.Thanks for the report! It seems 1.12.4 is same level with 1.12.3, it doesn’t resolve the issue. We should check the changes in 1.11.5 - 1.12.0.rc2 again.
@mtbtrifork Thanks for your report! It’s very informative. I’m now suspecting Ruby or excon. As I mentioned at https://github.com/fluent/fluentd/issues/3382#issuecomment-849479658 there are similar reports.
BTW we should discuss about this cause at #3382, because it’s same cause with yours, and this issue (#3389) isn’t yet judged as same cause with it.
At the company where I’m currently employed we’re using fluentd on 50+ servers via td-agent .deb packages available from http://packages.treasuredata.com/4/ubuntu/bionic/ . Almost all of the servers run .deb package version 4.1.0-1, i.e. fluentd version 1.12.1, and we have so far only experienced the 100% CPU issue on servers where we coincidentally upgraded to 4.1.1-1 / 1.12.3. My point is that the issue can perhaps be narrowed down between versions 1.12.1 and 1.12.3.
When the 100% CPU issue is present, a sigdump (
kill -CONT
) appears to always show a running thread with the randomisation going on in Ruby’s resolv.rb, which I have seen mentioned in https://github.com/fluent/fluentd/issues/3387#issuecomment-847850364 . I find it difficult to believe, however, that Ruby’s resolv.rb code should be causing the issue, as the td-agent .deb package for 1.12.1 seems to ship with the exact same Ruby version (2.7.0, and thereby the exact same resolv.rb) as the package for 1.12.3. And, as mentioned, we believe we don’t see the issue with 1.12.1.Sadly, we have not found a way to reliably reproduce the issue, and I don’t have anything else to add at this stage.