kubernetes: ci-kubernetes-node-kubelet-flaky has been constantly failing

Which jobs are failing: ci-kubernetes-node-kubelet-flaky

Which test(s) are failing:

  • [sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Embarrassingly Parallel (EP) workload
  • [sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads TensorFlow workload
  • [sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Integer Sort (IS) workload

Since when has it been failing: This has been failing since before 5/4 (a while).

Testgrid link: https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-flaky

Reason for failure:

[sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Embarrassingly Parallel (EP) workload 
_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_perf_test.go:118
 Unexpected error:
     <*errors.errorString \| 0xc00057e800>: {
         s: "pod ran to completion",
     }
     pod ran to completion
 occurred /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103
[sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads TensorFlow workload 

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_perf_test.go:127
 Unexpected error:
     <*errors.errorString \| 0xc00057e800>: {
         s: "pod ran to completion",
     }
     pod ran to completion
 occurred
 /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

[sig-node] Node Performance Testing [Serial] [Slow] [Flaky] Run node performance testing with pre-defined workloads NAS parallel benchmark (NPB) suite - Integer Sort (IS) workload

_output/local/go/src/k8s.io/kubernetes/test/e2e_node/node_perf_test.go:109
 Unexpected error:
     <*errors.errorString \| 0xc00057e800>: {
         s: "pod ran to completion",
     }
     pod ran to completion
 occurred
 /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/framework/pods.go:103

Anything else we need to know:

The code for the tests is in https://github.com/kubernetes/kubernetes/blob/6c9aab2b098d92adfeba76a4f542cac900393dd1/test/e2e_node/node_perf_test.go#L118

https://github.com/kubernetes/kubernetes/blob/6c9aab2b098d92adfeba76a4f542cac900393dd1/test/e2e_node/node_perf_test.go#L127

and https://github.com/kubernetes/kubernetes/blob/6c9aab2b098d92adfeba76a4f542cac900393dd1/test/e2e_node/node_perf_test.go#L109 respectively

/sig node /priority important-longterm /kind failing test /assign

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 22 (22 by maintainers)

Most upvoted comments

Also, for future reference: been debugging these test with this command

time make test-e2e-node FOCUS="\[Flaky\]" SKIP="" PARALLELISM=1 REMOTE=true DELETE_INSTANCES=true IMAGE_CONFIG_FILE=perf-image-config.yaml TEST_ARGS='--feature-gates=DynamicKubeletConfig=true --server-start-timeout=160s'

So it appears that the way to solve this is to change the machine instance type in the node config, such as in

https://github.com/kubernetes/test-infra/blob/0952907d173f7bebbf11662f4c3ef5f412bbd756/jobs/e2e_node/benchmark-config.yaml#L6

Overall, the question now is whether to 1.) move the tests to run on bigger machines, 2.) tweak the tests in some other way, 3.) remove them ? xref: https://github.com/kubernetes/test-infra/pull/17669#issuecomment-632179553