kubernetes: [Failing Test] node-kubelet-master (ci-kubernetes-node-kubelet)

Which jobs are failing:

node-kubelet-master (ci-kubernetes-node-kubelet)

Which test(s) are failing:

Node Tests

Since when has it been failing:

04-06-20 3:44 PDT

Testgrid link:

https://k8s-testgrid.appspot.com/sig-release-master-blocking#node-kubelet-master

Reason for failure:

W0406 13:57:01.106] 2020/04/06 13:57:01 main.go:314: Something went wrong: encountered 1 errors: [error during go run /go/src/k8s.io/kubernetes/test/e2e_node/runner/remote/run_remote.go --cleanup --logtostderr --vmodule=*=4 --ssh-env=gce --results-dir=/workspace/_artifacts --project=k8s-jkns-ci-node-e2e --zone=us-west1-b --ssh-user=prow --ssh-key=/workspace/.ssh/google_compute_engine --ginkgo-flags=--nodes=8 --focus="\[NodeConformance\]" --skip="\[Flaky\]|\[Serial\]" --test_args=--kubelet-flags="--cgroups-per-qos=true --cgroup-root=/" --test-timeout=1h5m0s --image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml (interrupted): exit status 1]
W0406 13:57:01.110] Traceback (most recent call last):
W0406 13:57:01.110]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 779, in <module>
W0406 13:57:01.110]     main(parse_args())
W0406 13:57:01.110]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 627, in main
W0406 13:57:01.111]     mode.start(runner_args)
W0406 13:57:01.111]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 262, in start
W0406 13:57:01.111]     check_env(env, self.command, *args)
W0406 13:57:01.111]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0406 13:57:01.111]     subprocess.check_call(cmd, env=env)
W0406 13:57:01.111]   File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
W0406 13:57:01.112]     raise CalledProcessError(retcode, cmd)
W0406 13:57:01.112] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=node', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--gcp-project=k8s-jkns-ci-node-e2e', '--gcp-zone=us-west1-b', '--node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml', '--node-test-args=--kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"', '--node-tests=true', '--test_args=--nodes=8 --focus="\\[NodeConformance\\]" --skip="\\[Flaky\\]|\\[Serial\\]"', '--timeout=65m')' returned non-zero exit status 1

Anything else we need to know:

/cc @kubernetes/ci-signal /priority critical-urgent /milestone v1.19 /sig node

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 19 (19 by maintainers)

Commits related to this issue

Most upvoted comments

From https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet/1247299900119453700/build-log.txt:

I0406 23:10:31.875] unable to create gce instance with running docker daemon for image ubuntu-gke-1804-d1703-0-v20181113.  could not create instance tmp-node-e2e-3f80b00f-ubuntu-gke-1804-d1703-0-v20181113: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'IN_USE_ADDRESSES' exceeded.  Limit: 100.0 in region us-west1. ForceSendFields:[] NullFields:[]}]

The cause was due to quota exceeded.

It appears all tests using this project are experiencing the quota issue (https://testgrid.k8s.io/sig-node-kubelet). However, it appears that the issue may be stemming from the fact that on timeout the deferred cleanup is not being called here. Since there has not been recent changes to this test, I am guessing that something caused the tests to take longer, triggering the timeouts, and then this faulty cleanup on timeout behavior was observed for the first time. It seems like the solution here would be to fix the cleanup behavior, then potentially investigate if the timeout needs to be increased on these tests.

For context on how this gets run (for other folks investigating), here are results of my spelunking:

I checked the project and almost all of the VM instances were created around April 6. Most likely some tests hit timeout and the VMs weren’t cleaned up.

I deleted all but one VM in case anyone wanted to check the timeout issue.