kubernetes: [Failing test] diffResources in master-upgrade

Which jobs are failing:

  • ci-kubernetes-e2e-gce-new-master-upgrade-cluster-new
  • ci-kubernetes-e2e-gce-new-master-upgrade-cluster

Which test(s) are failing: diffResources

Since when has it been failing: 2019-03-01

Testgrid link:

Reason for failure:

Error: 12 leaked resources
+NAME                                                            ZONE        SIZE_GB  TYPE         STATUS
+bootstrap-e2e-78e337f9-3d2f-11e9-a656-2acb4231a6ed              us-west1-b  2        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-796bbc5d-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-85e2b93a-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY
+bootstrap-e2e-dynamic-pvc-97d1f1dd-3d2f-11e9-9e04-42010a8a0002  us-west1-b  1        pd-standard  READY

Anything else we need to know: Might be related: https://github.com/kubernetes/kubernetes/issues/74417 (a recent diffResources test failure) and https://github.com/kubernetes/kubernetes/issues/74887 (started failing at the same jobs, at the same time).

/sig storage /sig testing /kind failing-test /priority critical-urgent /milestone v1.14

cc @smourapina @alejandrox1 @kacole2 @mortent cc @kubernetes/sig-storage-test-failures

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 39 (35 by maintainers)

Commits related to this issue

Most upvoted comments

We’re aiming for a Go 1.12.1 soon: tonight at earliest, but likely tomorrow. Or worst case Friday if there are surprises.

I’ve started a discussion with the release team. Hopefully soon. It’s been long enough since the Go 1.12(.0) release and we have a few things on or ready for the Go 1.12 branch. We’ll keep you updated.

/cc @andybon @dmitshur @bcmills @ianlancetaylor

Go 1.12.1 is out: https://golang.org/dl/

(And a Go 1.11.x update.)

Do we have a list of golang project issues/prs/commits we’re looking at here? Like do we have concrete reason to believe some specific relevant things would be in a 1.12.1 (hopefully soon)?

@bradfitz to give some context, we are currently scheduled to lift our code freeze by Tuesday March 19th, and stop accepting changes by Thursday March 21st (https://github.com/kubernetes/sig-release/tree/master/releases/release-1.14#timeline)

I’m willing to push out code freeze for a day or two for this, but if we can’t try go1.12.1 by the 19th my gut says we need to revert. Would really make our lives easier if it was this week.

we’ll need a go1.12.1 for that… we shouldn’t try to sweep use of sync.Once to adjust to the go1.12 bug… there are too many uses inside vendor

as per https://github.com/golang/go/commit/91fd14b82493e592730a3e459ef6610195b854c2 something else to try (which is actually the better test solution here) is to declare the symbol of the deferred function.

pseudo:

func readyTest(once *sunc.Once, sem *chaosmonkey.Semaphore) {
	once.Do(func() {
		sem.Ready()
	})
}

func (cma *chaosMonkeyAdapter) Test(sem *chaosmonkey.Semaphore) {
	start := time.Now()
	var once sync.Once
	...
	defer readyTest(&once, sem)
	...
	readyTest(&once, sem)
	...
}

CAS idea is in #75305. I don’t have a better idea than trying it and seeing if it works…

I did ping @krousey and (I think) he concurs. Your analysis was much deeper than what I did, @neolit123 😃 I was thinking it was something to do with copying the mutex somehow.

What I propose we do is that we get a compare-and-swap in there (I’ll send a PR) and then we get the attention of the go team as it feels like a 1.12 regression if it does fix it.

I haven’t been able to reproduce the problem in isolation.

In https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-cluster/2826/build-log.txt, I see that the first thing the test does is upgrade the cluster, and that failed:

I0305 04:24:27.797] fatal error: sync: inconsistent mutex state
I0305 04:24:27.887] Test Suite Failed
W0305 04:24:27.988] !!! Error in ./hack/ginkgo-e2e.sh:143
W0305 04:24:27.989]   Error in ./hack/ginkgo-e2e.sh:143. '"${ginkgo}" "${ginkgo_args[@]:+${ginkgo_args[@]}}" "${e2e_test}" -- "${auth_config[@]:+${auth_config[@]}}" --ginkgo.flakeAttempts="${FLAKE_ATTEMPTS}" --host="${KUBE_MASTER_URL}" --provider="${KUBERNETES_PROVIDER}" --gce-project="${PROJECT:-}" --gce-zone="${ZONE:-}" --gce-region="${REGION:-}" --gce-multizone="${MULTIZONE:-false}" --gke-cluster="${CLUSTER_NAME:-}" --kube-master="${KUBE_MASTER:-}" --cluster-tag="${CLUSTER_ID:-}" --cloud-config-file="${CLOUD_CONFIG:-}" --repo-root="${KUBE_ROOT}" --node-instance-group="${NODE_INSTANCE_GROUP:-}" --prefix="${KUBE_GCE_INSTANCE_PREFIX:-e2e}" --network="${KUBE_GCE_NETWORK:-${KUBE_GKE_NETWORK:-e2e}}" --node-tag="${NODE_TAG:-}" --master-tag="${MASTER_TAG:-}" --cluster-monitoring-mode="${KUBE_ENABLE_CLUSTER_MONITORING:-standalone}" --prometheus-monitoring="${KUBE_ENABLE_PROMETHEUS_MONITORING:-false}" --dns-domain="${KUBE_DNS_DOMAIN:-cluster.local}" --ginkgo.slowSpecThreshold="${GINKGO_SLOW_SPEC_THRESHOLD:-300}" ${KUBE_CONTAINER_RUNTIME:+"--container-runtime=${KUBE_CONTAINER_RUNTIME}"} ${MASTER_OS_DISTRIBUTION:+"--master-os-distro=${MASTER_OS_DISTRIBUTION}"} ${NODE_OS_DISTRIBUTION:+"--node-os-distro=${NODE_OS_DISTRIBUTION}"} ${NUM_NODES:+"--num-nodes=${NUM_NODES}"} ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} "${@:-}"' exited with status 1

I think the aborted upgrade test is what caused the resources to leak. For example, here’s one leaked resource:

+bootstrap-e2e-dynamic-pvc-8b28e181-3efb-11e9-ad32-42010a8a0002  us-west1-b  1        pd-standard  READY

In the job, it is used by a pod that’s part of the upgrade test:

I0305 14:00:17.224541       1 pv_controller.go:512] synchronizing PersistentVolume[pvc-8b28e181-3efb-11e9-ad32-42010a8a0002]: phase: Bound, bound to: "sig-apps-statefulset-upgrade-7984/datadir-ss-0 (uid: 8b28e181-3efb-11e9-ad32-42010a8a0002)", boundByController: true