kubernetes: [Failing Test] node-kubelet-master (ci-kubernetes-node-kubelet)

Which jobs are failing:

node-kubelet-master (ci-kubernetes-node-kubelet)

Which test(s) are failing:

Node Tests

Since when has it been failing:

04-06-20 3:44 PDT

Testgrid link:

https://k8s-testgrid.appspot.com/sig-release-master-blocking#node-kubelet-master

Reason for failure:

W0406 13:57:01.106] 2020/04/06 13:57:01 main.go:314: Something went wrong: encountered 1 errors: [error during go run /go/src/k8s.io/kubernetes/test/e2e_node/runner/remote/run_remote.go --cleanup --logtostderr --vmodule=*=4 --ssh-env=gce --results-dir=/workspace/_artifacts --project=k8s-jkns-ci-node-e2e --zone=us-west1-b --ssh-user=prow --ssh-key=/workspace/.ssh/google_compute_engine --ginkgo-flags=--nodes=8 --focus="\[NodeConformance\]" --skip="\[Flaky\]|\[Serial\]" --test_args=--kubelet-flags="--cgroups-per-qos=true --cgroup-root=/" --test-timeout=1h5m0s --image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml (interrupted): exit status 1]
W0406 13:57:01.110] Traceback (most recent call last):
W0406 13:57:01.110]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 779, in <module>
W0406 13:57:01.110]     main(parse_args())
W0406 13:57:01.110]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 627, in main
W0406 13:57:01.111]     mode.start(runner_args)
W0406 13:57:01.111]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 262, in start
W0406 13:57:01.111]     check_env(env, self.command, *args)
W0406 13:57:01.111]   File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0406 13:57:01.111]     subprocess.check_call(cmd, env=env)
W0406 13:57:01.111]   File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
W0406 13:57:01.112]     raise CalledProcessError(retcode, cmd)
W0406 13:57:01.112] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=node', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--gcp-project=k8s-jkns-ci-node-e2e', '--gcp-zone=us-west1-b', '--node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml', '--node-test-args=--kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"', '--node-tests=true', '--test_args=--nodes=8 --focus="\\[NodeConformance\\]" --skip="\\[Flaky\\]|\\[Serial\\]"', '--timeout=65m')' returned non-zero exit status 1

Anything else we need to know:

/cc @kubernetes/ci-signal /priority critical-urgent /milestone v1.19 /sig node

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 19 (19 by maintainers)

Commits related to this issue

jobs: pin release-informing node jobs away from k8s-jkns-ci-node-e2e release-blocking jobs were already moved to a different GCP project in #17251 and #17257 due to network issues. This also moves th... — committed to hasheddan/test-infra by hasheddan 4 years ago
Change project for Topology/CPU Manager CI jobs The Topology Manager and CPU Manager CI jobs have been failing for the past 2 weeks or so. The sad part is that the email notification was not working.... — committed to vpickard/test-infra by vpickard 4 years ago

Most upvoted comments

From https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet/1247299900119453700/build-log.txt:

I0406 23:10:31.875] unable to create gce instance with running docker daemon for image ubuntu-gke-1804-d1703-0-v20181113.  could not create instance tmp-node-e2e-3f80b00f-ubuntu-gke-1804-d1703-0-v20181113: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'IN_USE_ADDRESSES' exceeded.  Limit: 100.0 in region us-west1. ForceSendFields:[] NullFields:[]}]

The cause was due to quota exceeded.

tedyu on Apr 7, 2020

It appears all tests using this project are experiencing the quota issue (https://testgrid.k8s.io/sig-node-kubelet). However, it appears that the issue may be stemming from the fact that on timeout the deferred cleanup is not being called here. Since there has not been recent changes to this test, I am guessing that something caused the tests to take longer, triggering the timeouts, and then this faulty cleanup on timeout behavior was observed for the first time. It seems like the solution here would be to fix the cleanup behavior, then potentially investigate if the timeout needs to be increased on these tests.

For context on how this gets run (for other folks investigating), here are results of my spelunking:

Identified failure on sig-release-master-blocking testgrid dashboard
Found job run that had a timeout rather than the quota exception
Took a look at the prow job yaml and identified where it resides in test-infra
Looked at job logs to identify where the error was happening
Followed the execution flow:
- This job uses kubekins-e2e image
- The kubekins-e2e image is build from the bootstrap image
- The bootstrap image uses bootstrap.py as its entrypoint
- bootstrap.py checks scenario and invokes scenarios/kubernetes_e2e.py per job config (ref in build log)
- kubernetes_e2e.py then starts kubetest with appropriate args
- kubetest main.go calls e2e.go, which runs tests by running node tests with passed in args
- This takes us to process.go, which finally actually runs the run_remote.go which is where sig-node has written the tests in k/k
- All of this brings us to where we want to delete instances we created, which I presume we are not doing when these tests timeout

hasheddan on Apr 14, 2020

I checked the project and almost all of the VM instances were created around April 6. Most likely some tests hit timeout and the VMs weren’t cleaned up.

I deleted all but one VM in case anyone wanted to check the timeout issue.

yujuhong on Apr 7, 2020