kubernetes: Flaky timeouts while waiting for RC pods to be running in density test
Follows from https://github.com/kubernetes/kubernetes/issues/60500#issuecomment-372395317.
We’re seeing such flakes in our density test:
Error while waiting for replication controller density-latency-pod-54 pods to be running: Timeout while waiting for pods with labels "name=density-latency-pod-54,type=density-latency-pod" to be running
E.g:
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11568
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11567
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11560
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11553
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11545
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11528
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11507
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11497
- https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11492
I’m not very sure if this should be a release-blocker - but we need to understand why this is happening.
/sig scalability /kind bug /priority important-soon
cc @wojtek-t
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 29 (28 by maintainers)
@janetkuo @enisoc - thanks a lot that’s super useful!
So I did a bit more debugging. And what is happening is that in all cases that I’ve seen is:
So that flow seems reasonable.
My only question now is why daemon-set controller is deleting and recreating fluentd on some node. What is important is that this operation is not done on all nodes.
As an example, in this run: http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/11664/artifacts/e2e-big-master/
There were 100 nodes, but during the whole run (~40 minutes), controller-manager recreated fluentd pods only on 39 nodes. @janetkuo @enisoc What is triggering those recreations?
It looks like one of the RC’s pods Failed? RC never sets its pods to Failed state, because its
restartPolicyisAlways. It must be kubelet.When kubelet evicts pods, it sets pod state to Failed but doesn’t delete evicted pods (#54525). @enisoc we’ve debugged another issue related to this.
RC doesn’t take any inactive pods (failed, succeeded, terminating) into account: https://github.com/kubernetes/kubernetes/blob/02611149c181f7fd9ab116c40d7ee32fb5934b7c/pkg/controller/replicaset/replica_set.go#L607-L612 From RC’s view, there’s only 1 active replica so RC didn’t try to remove more replicas.