kubernetes: TestCascadingDeletion flaky [timed out waiting for the condition]

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56097/pull-kubernetes-unit/67068/

go test -v k8s.io/kubernetes/test/integration/garbagecollector -run TestCascadingDeletion$
W1120 23:43:24.574825   19542 authentication.go:56] Authentication is disabled
I1120 23:43:24.581939   19542 serve.go:86] Serving securely on 127.0.0.1:39739
I1120 23:43:24.582037   19542 crd_finalizer.go:242] Starting CRDFinalizer
I1120 23:43:24.582077   19542 customresource_discovery_controller.go:152] Starting DiscoveryController
I1120 23:43:24.582117   19542 naming_controller.go:274] Starting NamingConditionController
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"apiextensions.k8s.io/v1beta1","resources":[{"name":"customresourcedefinitions","singularName":"","namespaced":false,"kind":"CustomResourceDefinition","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["crd"]},{"name":"customresourcedefinitions/status","singularName":"","namespaced":false,"kind":"CustomResourceDefinition","verbs":["update"]}]}
I1120 23:43:24.614636   19542 crd_finalizer.go:246] Shutting down CRDFinalizer
I1120 23:43:24.614731   19542 naming_controller.go:278] Shutting down NamingConditionController
E1120 23:43:24.614641   19542 customresource_discovery_controller.go:155] timed out waiting for caches to sync
I1120 23:43:24.615316   19542 customresource_discovery_controller.go:156] Shutting down DiscoveryController
I1120 23:43:24.614679   19542 serve.go:137] Stopped listening on 127.0.0.1:39739

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

E1120 23:43:24.614641 19542 customresource_discovery_controller.go:155] timed out waiting for caches to sync This should not occur unless stopCh closed. But no more log. And from the ci log, we can see the kube-apiserever is up and down for many times, So I do not know why.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 23 (23 by maintainers)

Commits related to this issue

Merge pull request #57503 from liggitt/gc-virtual-node-fix Automatic merge from submit-queue (batch tested with PRs 57735, 57503). If you want to cherry-pick this change to another branch, please fol... — committed to dims/kubernetes by deleted user 6 years ago

Most upvoted comments

Finally tracked it down. I can consistently recreate this by artificially triggering the “too old resource version” error (which is what we were seeing in all the failed test job logs) on first replication controller watch establishment. That results in the following flow:


time	pod informer	rc informer	test driver	graph builder	attemptToDelete
0	list pods, get [] at rv=10	list rcs, get [] at rv=10
1	mark synced	mark synced
2			create rc.test.1 at rv=11
3			create pod.test at rv=12
4	watch at rv=10	watch at rv=10, get “too old” error
5	observe pod.test create at rv=12
6				process pod.test create at rv=12
7				add virtual node for rc.test.1
8				add node for pod.test
9					lookup rc.test.1
10					find rc.test.1 at rv=11
11			delete rc at rv=13
12		list rcs, get [] at rv=13
13		watch at rv=13

That leaves us with a node in the GC graph for rc.test.1 that was confirmed by direct lookup in attemptToDelete, but never observed via informer events. Because rc.test.1 was created AND deleted during the one second watch outage, the informer will never deliver a deletion event for that object.

liggitt on Dec 21, 2017

@liggitt If we explicitly sleep more than 1 second, this probablity can be reduced.

where are you thinking we would add the sleep? I am more interested in tracking the virtual nodes that have not been observed via informer more rigorously to avoid them getting orphaned in the graph, rather than trying to time our way out of this bug. Watch re-establishment can take longer than one second (under some circumstances, significantly longer)

liggitt on Dec 21, 2017

Good news is I was able to reproduce the issue with stress (took ~300 iterations to trigger).

ironcladlou on Nov 29, 2017

Opened https://github.com/kubernetes/kubernetes/pull/57503 with fix

liggitt on Dec 21, 2017

this pattern seems common in the failed jobs:

we get a watch failure on replication controllers, e.g. W1219 16:44:47.799199 23324 reflector.go:341] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.ReplicationController ended with: too old resource version: 4291 (4860)
we are missing a graph_builder.go:588] GraphBuilder process object log message for the test.rc.1 replication controller

the only three events we have for test.rc.1 in failed runs:

I1219 16:44:48.199297 23324 graph_builder.go:396] add virtual node.identity: [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]
I1219 16:44:48.199391 23324 garbagecollector.go:374] processing item [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]
I1219 16:44:48.209573 23324 garbagecollector.go:411] object [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]'s doesn't have an owner, continue on next item

liggitt on Dec 19, 2017