kubernetes: TestCascadingDeletion flaky [timed out waiting for the condition]
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/pr-logs/pull/56097/pull-kubernetes-unit/67068/
go test -v k8s.io/kubernetes/test/integration/garbagecollector -run TestCascadingDeletion$
W1120 23:43:24.574825 19542 authentication.go:56] Authentication is disabled
I1120 23:43:24.581939 19542 serve.go:86] Serving securely on 127.0.0.1:39739
I1120 23:43:24.582037 19542 crd_finalizer.go:242] Starting CRDFinalizer
I1120 23:43:24.582077 19542 customresource_discovery_controller.go:152] Starting DiscoveryController
I1120 23:43:24.582117 19542 naming_controller.go:274] Starting NamingConditionController
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"apiextensions.k8s.io/v1beta1","resources":[{"name":"customresourcedefinitions","singularName":"","namespaced":false,"kind":"CustomResourceDefinition","verbs":["create","delete","deletecollection","get","list","patch","update","watch"],"shortNames":["crd"]},{"name":"customresourcedefinitions/status","singularName":"","namespaced":false,"kind":"CustomResourceDefinition","verbs":["update"]}]}
I1120 23:43:24.614636 19542 crd_finalizer.go:246] Shutting down CRDFinalizer
I1120 23:43:24.614731 19542 naming_controller.go:278] Shutting down NamingConditionController
E1120 23:43:24.614641 19542 customresource_discovery_controller.go:155] timed out waiting for caches to sync
I1120 23:43:24.615316 19542 customresource_discovery_controller.go:156] Shutting down DiscoveryController
I1120 23:43:24.614679 19542 serve.go:137] Stopped listening on 127.0.0.1:39739
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
E1120 23:43:24.614641 19542 customresource_discovery_controller.go:155] timed out waiting for caches to sync This should not occur unless stopCh closed. But no more log. And from the ci log, we can see the kube-apiserever is up and down for many times, So I do not know why.
Environment:
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 23 (23 by maintainers)
Finally tracked it down. I can consistently recreate this by artificially triggering the “too old resource version” error (which is what we were seeing in all the failed test job logs) on first replication controller watch establishment. That results in the following flow:
That leaves us with a node in the GC graph for rc.test.1 that was confirmed by direct lookup in attemptToDelete, but never observed via informer events. Because rc.test.1 was created AND deleted during the one second watch outage, the informer will never deliver a deletion event for that object.
where are you thinking we would add the sleep? I am more interested in tracking the virtual nodes that have not been observed via informer more rigorously to avoid them getting orphaned in the graph, rather than trying to time our way out of this bug. Watch re-establishment can take longer than one second (under some circumstances, significantly longer)
Good news is I was able to reproduce the issue with
stress(took ~300 iterations to trigger).Opened https://github.com/kubernetes/kubernetes/pull/57503 with fix
this pattern seems common in the failed jobs:
W1219 16:44:47.799199 23324 reflector.go:341] k8s.io/kubernetes/vendor/k8s.io/client-go/informers/factory.go:86: watch of *v1.ReplicationController ended with: too old resource version: 4291 (4860)graph_builder.go:588] GraphBuilder process objectlog message for the test.rc.1 replication controllerthe only three events we have for test.rc.1 in failed runs:
I1219 16:44:48.199297 23324 graph_builder.go:396] add virtual node.identity: [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]I1219 16:44:48.199391 23324 garbagecollector.go:374] processing item [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]I1219 16:44:48.209573 23324 garbagecollector.go:411] object [v1/ReplicationController, namespace: gc-cascading-deletion, name: test.rc.1, uid: ecaed94f-e4db-11e7-88ab-0242ac110002]'s doesn't have an owner, continue on next item