kubernetes: Fluentd-scaler causing fluentd pod deletions and messes with ds-controller

Forking from https://github.com/kubernetes/kubernetes/issues/60500#issuecomment-373121164:

To summarize, here’s what we observed:

  1. PATCH daemonset calls coming every minute from both fluentd-scaler and addon-manager (verified this by individually turning them on/off). Things we need to understand here:
  • Is the daemonset object continuously toggling b/w 2 states? (we know here that it’s RV is increasing continuously)
  • If yes, what field(s) in the object are changing? IIRC value of some label/annotation (I think ‘UpdatedPodsScheduled’) is changing (probably related to 2 below)
  • Also, why/should the fluentd-scaler send any api request if the resources are already set to the right value?
  1. Fluentd pods are getting deleted and recreated by daemonset-controller when the scaler is enabled (as was also seen in https://github.com/kubernetes/kubernetes/issues/60500#issuecomment-373001797). Why this is happening? One thing to note here - all those delete calls are preceded by PUT pod-status calls from respective kubelets (but maybe that’s expected).

cc @kubernetes/sig-instrumentation-bugs @crassirostris @liggitt /priority critical-urgent /assign @x13n

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 37 (36 by maintainers)

Commits related to this issue

Most upvoted comments

I spoke offline with @x13n and suggested that we should increase maxUnavailable for the fluentd daemonset to a large enough value so that we’re not bottlenecked by it. My reasoning is:

  • fluentd is not some user workload which will have some service downtime in case there are many unavailable at once
  • fluentds on nodes are independent of each other (i.e one fluentd shouldn’t need to wait for another node’s fluentd to come up)
  • we should try to make the scaling happen as fast as possible - otherwise it may be too late before they cope-up with increased log-load in the cluster

I’m going to make that change and test it against my PR (thanks @x13n for pointing that we can change maxUnavailable directly in ds config).

@wojtek-t @jdumars Feel free to override me if you have a good reason 😃