kubernetes: [Failing test] diffResources in master-upgrade
Which jobs are failing:
ci-kubernetes-e2e-gce-new-master-upgrade-cluster-newci-kubernetes-e2e-gce-new-master-upgrade-cluster
Which test(s) are failing:
diffResources
Since when has it been failing: 2019-03-01
Testgrid link:
- https://testgrid.k8s.io/sig-release-master-upgrade#gce-new-master-upgrade-cluster
- https://testgrid.k8s.io/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new
Reason for failure:
Error: 12 leaked resources
+NAME ZONE SIZE_GB TYPE STATUS
+bootstrap-e2e-78e337f9-3d2f-11e9-a656-2acb4231a6ed us-west1-b 2 pd-standard READY
+bootstrap-e2e-dynamic-pvc-796bbc5d-3d2f-11e9-9e04-42010a8a0002 us-west1-b 1 pd-standard READY
+bootstrap-e2e-dynamic-pvc-85e2b93a-3d2f-11e9-9e04-42010a8a0002 us-west1-b 1 pd-standard READY
+bootstrap-e2e-dynamic-pvc-97d1f1dd-3d2f-11e9-9e04-42010a8a0002 us-west1-b 1 pd-standard READY
Anything else we need to know: Might be related: https://github.com/kubernetes/kubernetes/issues/74417 (a recent diffResources test failure) and https://github.com/kubernetes/kubernetes/issues/74887 (started failing at the same jobs, at the same time).
/sig storage /sig testing /kind failing-test /priority critical-urgent /milestone v1.14
cc @smourapina @alejandrox1 @kacole2 @mortent cc @kubernetes/sig-storage-test-failures
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 39 (35 by maintainers)
Commits related to this issue
- Speculative workaround for #74890 We try using an atomic with a CAS, as a potential workaround for issue #74890. Kudos to @neolit123 for the investigation & idea. This is a speculative workaround -... — committed to justinsb/kubernetes by justinsb 5 years ago
- Merge pull request #75305 from justinsb/workaround_once_mutex_issue Speculative workaround for #74890 — committed to kubernetes/kubernetes by k8s-ci-robot 5 years ago
- Revert "Speculative workaround for #74890" — committed to neolit123/kubernetes by neolit123 5 years ago
We’re aiming for a Go 1.12.1 soon: tonight at earliest, but likely tomorrow. Or worst case Friday if there are surprises.
I’ve started a discussion with the release team. Hopefully soon. It’s been long enough since the Go 1.12(.0) release and we have a few things on or ready for the Go 1.12 branch. We’ll keep you updated.
/cc @andybon @dmitshur @bcmills @ianlancetaylor
Go 1.12.1 is out: https://golang.org/dl/
(And a Go 1.11.x update.)
Do we have a list of golang project issues/prs/commits we’re looking at here? Like do we have concrete reason to believe some specific relevant things would be in a 1.12.1 (hopefully soon)?
@bradfitz to give some context, we are currently scheduled to lift our code freeze by Tuesday March 19th, and stop accepting changes by Thursday March 21st (https://github.com/kubernetes/sig-release/tree/master/releases/release-1.14#timeline)
I’m willing to push out code freeze for a day or two for this, but if we can’t try go1.12.1 by the 19th my gut says we need to revert. Would really make our lives easier if it was this week.
we’ll need a go1.12.1 for that… we shouldn’t try to sweep use of sync.Once to adjust to the go1.12 bug… there are too many uses inside vendor
as per https://github.com/golang/go/commit/91fd14b82493e592730a3e459ef6610195b854c2 something else to try (which is actually the better test solution here) is to declare the symbol of the deferred function.
pseudo:
CAS idea is in #75305. I don’t have a better idea than trying it and seeing if it works…
I did ping @krousey and (I think) he concurs. Your analysis was much deeper than what I did, @neolit123 😃 I was thinking it was something to do with copying the mutex somehow.
What I propose we do is that we get a compare-and-swap in there (I’ll send a PR) and then we get the attention of the go team as it feels like a 1.12 regression if it does fix it.
I haven’t been able to reproduce the problem in isolation.
In https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2826/build-log.txt, I see that the first thing the test does is upgrade the cluster, and that failed:
I think the aborted upgrade test is what caused the resources to leak. For example, here’s one leaked resource:
In the job, it is used by a pod that’s part of the upgrade test: