kubernetes: [Failing Test] gce-windows-2019-master (ci-kubernetes-e2e-windows-gce-2019)

Which jobs are failing: gce-windows-2019-master (ci-kubernetes-e2e-windows-gce-2019)

Which test(s) are failing: [sig-cli] Kubectl client Guestbook application should create and stop a working application [Conformance]

Since when has it been failing: 18th March 09:47 PDT

Testgrid link: https://testgrid.k8s.io/sig-release-master-informing#gce-windows-2019-master

Reason for failure:

Full Stack Trace
k8s.io/kubernetes/test/e2e/kubectl.validateGuestbookApp(0x534a9e0, 0xc00300f8c0, 0xc001af36a0, 0xc)
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl/kubectl.go:1858 +0x5d0
k8s.io/kubernetes/test/e2e/kubectl.glob..func1.7.2()
	/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/kubectl/kubectl.go:342 +0x165
k8s.io/kubernetes/test/e2e.RunE2ETests(0xc001a32300)
	_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:125 +0x324
k8s.io/kubernetes/test/e2e.TestE2E(0xc001a32300)
	_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:119 +0x2b
testing.tRunner(0xc001a32300, 0x4ae80c8)
	/usr/local/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:960 +0x350
STEP: using delete to clean up resources

Anything else we need to know: /cc @kubernetes/ci-signal /milestone v1.19 /priority important-soon /assign @soltysh /sig cli

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

A bug was introduced a couple days ago that is preventing Windows clusters on GCE from starting up. Will try to fix that tomorrow.

So, I figured out what’s happening.

As you can see, the test is flaky. When it fails, the following error could be seen in the agnhost worker pods:

2020/05/11 15:23:32 --slaveof param and/or --backend-port param are invalid. lookup agnhost-master: no such host

That basically means that either the DNS name could not be resolved, or it’s just a general network failure. To be more precise as to what happens and what leads to this error, it would be this: the container / agnhost app starts before the pod networking has been fully set up / configured, which means that when it tries to resolve agnhost-master, it will fail.

This can be easily observed by modifying the worker pod (test/e2e/testing-manifests/guestbook/agnhost-slave-deployment.yaml.in) from:

args: [ "guestbook", "--slaveof", "agnhost-master", "--http-port", "6379" ]

to:

command: ["bash", "-c", "echo 'sleeping 5 seconds.' && sleep 5 && /agnhost guestbook --slaveof agnhost-master --http-port 6379"]

the test consistently passes.

Ideally, we would fix this issue by making sure that all the networking is properly set up before the container entrypoint starts. Alternatively, we’d add a few retries in agnhost’s guestbook subcommand.