kubernetes: [Failing test] diffResources in master-upgrade

Which jobs are failing:

ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new
ci-kubernetes-e2e-gce-new-master-upgrade-cluster

Which test(s) are failing: diffResources

Since when has it been failing: 2019-03-01

Testgrid link:

Reason for failure:

Error: 12 leaked resources
+NAME                                                            ZONE        SIZE_GB  TYPE         STATUS
+bootstrap-e2e-78e337f9-3d2f-11e9-a656-2acb4231a6ed              us-west1-b  2        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-796bbc5d-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-85e2b93a-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-97d1f1dd-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY

Anything else we need to know: Might be related: https://github.com/kubernetes/kubernetes/issues/74417 (a recent diffResources test failure) and https://github.com/kubernetes/kubernetes/issues/74887 (started failing at the same jobs, at the same time).

/sig storage /sig testing /kind failing-test /priority critical-urgent /milestone v1.14

cc @smourapina @alejandrox1 @kacole2 @mortent cc @kubernetes/sig-storage-test-failures

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 39 (35 by maintainers)

Commits related to this issue

Speculative workaround for #74890 We try using an atomic with a CAS, as a potential workaround for issue #74890. Kudos to @neolit123 for the investigation & idea. This is a speculative workaround -... — committed to justinsb/kubernetes by justinsb 5 years ago
Merge pull request #75305 from justinsb/workaround_once_mutex_issue Speculative workaround for #74890 — committed to kubernetes/kubernetes by k8s-ci-robot 5 years ago
Revert "Speculative workaround for #74890" — committed to neolit123/kubernetes by neolit123 5 years ago

Most upvoted comments

We’re aiming for a Go 1.12.1 soon: tonight at earliest, but likely tomorrow. Or worst case Friday if there are surprises.

+12

bradfitz on Mar 13, 2019

I’ve started a discussion with the release team. Hopefully soon. It’s been long enough since the Go 1.12(.0) release and we have a few things on or ready for the Go 1.12 branch. We’ll keep you updated.

/cc @andybon @dmitshur @bcmills @ianlancetaylor

bradfitz on Mar 13, 2019

Go 1.12.1 is out: https://golang.org/dl/

(And a Go 1.11.x update.)

bradfitz on Mar 14, 2019

Do we have a list of golang project issues/prs/commits we’re looking at here? Like do we have concrete reason to believe some specific relevant things would be in a 1.12.1 (hopefully soon)?

tpepper on Mar 13, 2019

@bradfitz to give some context, we are currently scheduled to lift our code freeze by Tuesday March 19th, and stop accepting changes by Thursday March 21st (https://github.com/kubernetes/sig-release/tree/master/releases/release-1.14#timeline)

I’m willing to push out code freeze for a day or two for this, but if we can’t try go1.12.1 by the 19th my gut says we need to revert. Would really make our lives easier if it was this week.

spiffxp on Mar 13, 2019

we’ll need a go1.12.1 for that… we shouldn’t try to sweep use of sync.Once to adjust to the go1.12 bug… there are too many uses inside vendor

liggitt on Mar 13, 2019

as per https://github.com/golang/go/commit/91fd14b82493e592730a3e459ef6610195b854c2 something else to try (which is actually the better test solution here) is to declare the symbol of the deferred function.

pseudo:

func readyTest(once *sunc.Once, sem *chaosmonkey.Semaphore) {
	once.Do(func() {
		sem.Ready()
	})
}

func (cma *chaosMonkeyAdapter) Test(sem *chaosmonkey.Semaphore) {
	start := time.Now()
	var once sync.Once
	...
	defer readyTest(&once, sem)
	...
	readyTest(&once, sem)
	...
}

neolit123 on Mar 13, 2019

CAS idea is in #75305. I don’t have a better idea than trying it and seeing if it works…

justinsb on Mar 12, 2019

I did ping @krousey and (I think) he concurs. Your analysis was much deeper than what I did, @neolit123 😃 I was thinking it was something to do with copying the mutex somehow.

What I propose we do is that we get a compare-and-swap in there (I’ll send a PR) and then we get the attention of the go team as it feels like a 1.12 regression if it does fix it.

I haven’t been able to reproduce the problem in isolation.

justinsb on Mar 12, 2019

In https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2826/build-log.txt, I see that the first thing the test does is upgrade the cluster, and that failed:

I0305 04:24:27.797] fatal error: sync: inconsistent mutex state
I0305 04:24:27.887] Test Suite Failed
W0305 04:24:27.988] !!! Error in ./hack/ginkgo-e2e.sh:143
W0305 04:24:27.989]   Error in ./hack/ginkgo-e2e.sh:143. '"${ginkgo}" "${ginkgo_args[@]:+${ginkgo_args[@]}}" "${e2e_test}" -- "${auth_config[@]:+${auth_config[@]}}" --ginkgo.flakeAttempts="${FLAKE_ATTEMPTS}" --host="${KUBE_MASTER_URL}" --provider="${KUBERNETES_PROVIDER}" --gce-project="${PROJECT:-}" --gce-zone="${ZONE:-}" --gce-region="${REGION:-}" --gce-multizone="${MULTIZONE:-false}" --gke-cluster="${CLUSTER_NAME:-}" --kube-master="${KUBE_MASTER:-}" --cluster-tag="${CLUSTER_ID:-}" --cloud-config-file="${CLOUD_CONFIG:-}" --repo-root="${KUBE_ROOT}" --node-instance-group="${NODE_INSTANCE_GROUP:-}" --prefix="${KUBE_GCE_INSTANCE_PREFIX:-e2e}" --network="${KUBE_GCE_NETWORK:-${KUBE_GKE_NETWORK:-e2e}}" --node-tag="${NODE_TAG:-}" --master-tag="${MASTER_TAG:-}" --cluster-monitoring-mode="${KUBE_ENABLE_CLUSTER_MONITORING:-standalone}" --prometheus-monitoring="${KUBE_ENABLE_PROMETHEUS_MONITORING:-false}" --dns-domain="${KUBE_DNS_DOMAIN:-cluster.local}" --ginkgo.slowSpecThreshold="${GINKGO_SLOW_SPEC_THRESHOLD:-300}" ${KUBE_CONTAINER_RUNTIME:+"--container-runtime=${KUBE_CONTAINER_RUNTIME}"} ${MASTER_OS_DISTRIBUTION:+"--master-os-distro=${MASTER_OS_DISTRIBUTION}"} ${NODE_OS_DISTRIBUTION:+"--node-os-distro=${NODE_OS_DISTRIBUTION}"} ${NUM_NODES:+"--num-nodes=${NUM_NODES}"} ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} "${@:-}"' exited with status 1

I think the aborted upgrade test is what caused the resources to leak. For example, here’s one leaked resource:

+bootstrap-e2e-dynamic-pvc-8b28e181-3efb-11e9-ad32-42010a8a0002  us-west1-b  1        pd-standard  READY

In the job, it is used by a pod that’s part of the upgrade test:

I0305 14:00:17.224541       1 pv_controller.go:512] synchronizing PersistentVolume[pvc-8b28e181-3efb-11e9-ad32-42010a8a0002]: phase: Bound, bound to: "sig-apps-statefulset-upgrade-7984/datadir-ss-0 (uid: 8b28e181-3efb-11e9-ad32-42010a8a0002)", boundByController: true

msau42 on Mar 5, 2019