kubernetes: Fluentd-scaler causing fluentd pod deletions and messes with ds-controller

Forking from https://github.com/kubernetes/kubernetes/issues/60500#issuecomment-373121164:

To summarize, here’s what we observed:

PATCH daemonset calls coming every minute from both fluentd-scaler and addon-manager (verified this by individually turning them on/off). Things we need to understand here:

Is the daemonset object continuously toggling b/w 2 states? (we know here that it’s RV is increasing continuously)
If yes, what field(s) in the object are changing? IIRC value of some label/annotation (I think ‘UpdatedPodsScheduled’) is changing (probably related to 2 below)
Also, why/should the fluentd-scaler send any api request if the resources are already set to the right value?

Fluentd pods are getting deleted and recreated by daemonset-controller when the scaler is enabled (as was also seen in https://github.com/kubernetes/kubernetes/issues/60500#issuecomment-373001797). Why this is happening? One thing to note here - all those delete calls are preceded by PUT pod-status calls from respective kubelets (but maybe that’s expected).

cc @kubernetes/sig-instrumentation-bugs @crassirostris @liggitt /priority critical-urgent /assign @x13n

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 37 (36 by maintainers)

Commits related to this issue

Bump fluentd-gcp-scaler version Fixes #61190. This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`. — committed to x13n/kubernetes by x13n 6 years ago
Merge pull request #61225 from x13n/fluentd-gcp-scaler Automatic merge from submit-queue (batch tested with PRs 60888, 61225). If you want to cherry-pick this change to another branch, please follow ... — committed to kubernetes/kubernetes by deleted user 6 years ago
Merge pull request #61472 from shyamjvs/disable-fluentd-scaler Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="htt... — committed to kubernetes/kubernetes by deleted user 6 years ago
Merge pull request #61714 from shyamjvs/revert-fluentd-rolling-upgrade-change Automatic merge from submit-queue (batch tested with PRs 60519, 61099, 61218, 61166, 61714). If you want to cherry-pick t... — committed to kubernetes/kubernetes by deleted user 6 years ago
Merge pull request #61715 from shyamjvs/increase-density-cm-threshold Automatic merge from submit-queue (batch tested with PRs 60499, 61715, 61688, 61300, 58787). If you want to cherry-pick this chan... — committed to kubernetes/kubernetes by deleted user 6 years ago
Bump fluentd-gcp-scaler version Fixes #61190. This version verifies on its own whether resources should be updated or not, instead of relying on `kubectl set resources`. — committed to prameshj/kubernetes by x13n 6 years ago

Most upvoted comments

I spoke offline with @x13n and suggested that we should increase maxUnavailable for the fluentd daemonset to a large enough value so that we’re not bottlenecked by it. My reasoning is:

fluentd is not some user workload which will have some service downtime in case there are many unavailable at once
fluentds on nodes are independent of each other (i.e one fluentd shouldn’t need to wait for another node’s fluentd to come up)
we should try to make the scaling happen as fast as possible - otherwise it may be too late before they cope-up with increased log-load in the cluster

I’m going to make that change and test it against my PR (thanks @x13n for pointing that we can change maxUnavailable directly in ds config).

shyamjvs on Mar 22, 2018

@wojtek-t @jdumars Feel free to override me if you have a good reason 😃

shyamjvs on Mar 21, 2018