kubernetes: [Failing Test] node-kubelet-master (ci-kubernetes-node-kubelet)
Which jobs are failing:
node-kubelet-master (ci-kubernetes-node-kubelet)
Which test(s) are failing:
Node Tests
Since when has it been failing:
04-06-20 3:44 PDT
Testgrid link:
https://k8s-testgrid.appspot.com/sig-release-master-blocking#node-kubelet-master
Reason for failure:
W0406 13:57:01.106] 2020/04/06 13:57:01 main.go:314: Something went wrong: encountered 1 errors: [error during go run /go/src/k8s.io/kubernetes/test/e2e_node/runner/remote/run_remote.go --cleanup --logtostderr --vmodule=*=4 --ssh-env=gce --results-dir=/workspace/_artifacts --project=k8s-jkns-ci-node-e2e --zone=us-west1-b --ssh-user=prow --ssh-key=/workspace/.ssh/google_compute_engine --ginkgo-flags=--nodes=8 --focus="\[NodeConformance\]" --skip="\[Flaky\]|\[Serial\]" --test_args=--kubelet-flags="--cgroups-per-qos=true --cgroup-root=/" --test-timeout=1h5m0s --image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml (interrupted): exit status 1]
W0406 13:57:01.110] Traceback (most recent call last):
W0406 13:57:01.110] File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 779, in <module>
W0406 13:57:01.110] main(parse_args())
W0406 13:57:01.110] File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 627, in main
W0406 13:57:01.111] mode.start(runner_args)
W0406 13:57:01.111] File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 262, in start
W0406 13:57:01.111] check_env(env, self.command, *args)
W0406 13:57:01.111] File "/workspace/./test-infra/jenkins/../scenarios/kubernetes_e2e.py", line 111, in check_env
W0406 13:57:01.111] subprocess.check_call(cmd, env=env)
W0406 13:57:01.111] File "/usr/lib/python2.7/subprocess.py", line 190, in check_call
W0406 13:57:01.112] raise CalledProcessError(retcode, cmd)
W0406 13:57:01.112] subprocess.CalledProcessError: Command '('kubetest', '--dump=/workspace/_artifacts', '--gcp-service-account=/etc/service-account/service-account.json', '--up', '--down', '--test', '--deployment=node', '--provider=gce', '--cluster=bootstrap-e2e', '--gcp-network=bootstrap-e2e', '--gcp-project=k8s-jkns-ci-node-e2e', '--gcp-zone=us-west1-b', '--node-args=--image-config-file=/workspace/test-infra/jobs/e2e_node/image-config.yaml', '--node-test-args=--kubelet-flags="--cgroups-per-qos=true --cgroup-root=/"', '--node-tests=true', '--test_args=--nodes=8 --focus="\\[NodeConformance\\]" --skip="\\[Flaky\\]|\\[Serial\\]"', '--timeout=65m')' returned non-zero exit status 1
Anything else we need to know:
/cc @kubernetes/ci-signal /priority critical-urgent /milestone v1.19 /sig node
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 19 (19 by maintainers)
Commits related to this issue
- jobs: pin release-informing node jobs away from k8s-jkns-ci-node-e2e release-blocking jobs were already moved to a different GCP project in #17251 and #17257 due to network issues. This also moves th... — committed to hasheddan/test-infra by hasheddan 4 years ago
- Change project for Topology/CPU Manager CI jobs The Topology Manager and CPU Manager CI jobs have been failing for the past 2 weeks or so. The sad part is that the email notification was not working.... — committed to vpickard/test-infra by vpickard 4 years ago
From https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet/1247299900119453700/build-log.txt:
The cause was due to quota exceeded.
It appears all tests using this project are experiencing the quota issue (https://testgrid.k8s.io/sig-node-kubelet). However, it appears that the issue may be stemming from the fact that on timeout the deferred cleanup is not being called here. Since there has not been recent changes to this test, I am guessing that something caused the tests to take longer, triggering the timeouts, and then this faulty cleanup on timeout behavior was observed for the first time. It seems like the solution here would be to fix the cleanup behavior, then potentially investigate if the timeout needs to be increased on these tests.
For context on how this gets run (for other folks investigating), here are results of my spelunking:
sig-release-master-blockingtestgrid dashboardI checked the project and almost all of the VM instances were created around April 6. Most likely some tests hit timeout and the VMs weren’t cleaned up.
I deleted all but one VM in case anyone wanted to check the timeout issue.