kubernetes: ci-kubernetes-node-kubelet-flaky has been constantly failing
Which jobs are failing: ci-kubernetes-node-kubelet-flaky
Which test(s) are failing:
- [sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Embarrassingly Parallel (EP) workload
- [sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads TensorFlow workload
- [sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Integer Sort (IS) workload
Since when has it been failing: This has been failing since before 5/4 (a while).
Testgrid link: https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-flaky
Reason for failure:
[sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Embarrassingly Parallel (EP) workloadÂ
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_perf_test.go:118
Unexpected error:
<*errors.errorString \| 0xc00057e800>: {
s: "pod ran to completion",
}
pod ran to completion
occurred /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103
[sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads TensorFlow workloadÂ
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_perf_test.go:127
Unexpected error:
<*errors.errorString \| 0xc00057e800>: {
s: "pod ran to completion",
}
pod ran to completion
occurred
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103
[sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Integer Sort (IS) workload
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_perf_test.go:109
Unexpected error:
<*errors.errorString \| 0xc00057e800>: {
s: "pod ran to completion",
}
pod ran to completion
occurred
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103
Anything else we need to know:
The code for the tests is in https://github.com/kubernetes/kubernetes/blob/6c9aab2b098d92adfeba76a4f542cac900393dd1/test/e2e_node/node_perf_test.go#L118
and https://github.com/kubernetes/kubernetes/blob/6c9aab2b098d92adfeba76a4f542cac900393dd1/test/e2e_node/node_perf_test.go#L109 respectively
/sig node /priority important-longterm /kind failing test /assign
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 22 (22 by maintainers)
Also, for future reference: been debugging these test with this command
So it appears that the way to solve this is to change the machine instance type in the node config, such as in
https://github.com/kubernetes/test-infra/blob/0952907d173f7bebbf11662f4c3ef5f412bbd756/jobs/e2e_node/benchmark-config.yaml#L6
Overall, the question now is whether to 1.) move the tests to run on bigger machines, 2.) tweak the tests in some other way, 3.) remove them ? xref: https://github.com/kubernetes/test-infra/pull/17669#issuecomment-632179553