kubernetes: Flaky timeouts while waiting for RC pods to be running in density test

Follows from https://github.com/kubernetes/kubernetes/issues/60500#issuecomment-372395317.

We’re seeing such flakes in our density test:

Error while waiting for replication controller density-latency-pod-54 pods to be running: Timeout while waiting for pods with labels "name=density-latency-pod-54,type=density-latency-pod" to be running

E.g:

I’m not very sure if this should be a release-blocker - but we need to understand why this is happening.

/sig scalability /kind bug /priority important-soon

cc @wojtek-t

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 29 (28 by maintainers)

Commits related to this issue

Merge pull request #61351 from shyamjvs/fix-rc-pod-running-testutil Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href... — committed to kubernetes/kubernetes by deleted user 6 years ago

Most upvoted comments

@janetkuo @enisoc - thanks a lot that’s super useful!

So I did a bit more debugging. And what is happening is that in all cases that I’ve seen is:

daemon-set controller is removing fluentd from that node
the latency-pod is scheduled on that node
daemon-set controller is recreating fluentd on that node
kubelet is preempting latency-pod to make place for fluentd (which is critical pod)

So that flow seems reasonable.

My only question now is why daemon-set controller is deleting and recreating fluentd on some node. What is important is that this operation is not done on all nodes.

As an example, in this run: http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11664/artifacts/e2e-big-master/

There were 100 nodes, but during the whole run (~40 minutes), controller-manager recreated fluentd pods only on 39 nodes. @janetkuo @enisoc What is triggering those recreations?

wojtek-t on Mar 16, 2018

why replication controller created two pods (and only then why it didn’t remove the second one)

It looks like one of the RC’s pods Failed? RC never sets its pods to Failed state, because its restartPolicy is Always. It must be kubelet.

When kubelet evicts pods, it sets pod state to Failed but doesn’t delete evicted pods (#54525). @enisoc we’ve debugged another issue related to this.

RC doesn’t take any inactive pods (failed, succeeded, terminating) into account: https://github.com/kubernetes/kubernetes/blob/02611149c181f7fd9ab116c40d7ee32fb5934b7c/pkg/controller/replicaset/replica_set.go#L607-L612 From RC’s view, there’s only 1 active replica so RC didn’t try to remove more replicas.

janetkuo on Mar 15, 2018