istio: Test stability regression

Running summary:

  • Last prow update was 5/17 (https://github.com/istio/test-infra/commit/07e88ae5d09da7257ac863e3c5c052bbee754a84). No issues after this until 5/19.
  • On 5/19 we updated to kind v0.11.0
  • Around 5/19 night we started seeing elevated test failures. These were not isolated to a single failure or job, just generally things not working. A lot of them related to timeouts in api-server, etc.
  • Assuming it was a regression in kind, we reverted kind v0.11.0 back to v0.10.0. The errors persisted.
  • After analyzing the failures and the cluster, nothing stood out. Failures happened on ~all nodes, node resource utilization look normal, no errors, etc.
  • We updated docker to 20.10.6, no improvement
  • We updated the build cluster’s k8s version. This was mostly to bounce all of the nodes. No improvement.
  • We reverted to the exact docker image prior to kind v0.11.0 (rather than reverting the change in Dockerfile and rebuilding), same issues occurring

Aside from the Kind v0.11.0 update, there were no known changes to any of our infrastructure

The failures seem to have one thing in common: containerd.

I have also checked our internal prow instance (which is a completely distinct build cluster, etc) and we do see the grpc: the client connection is closing errors; we don’t run kind at all for those.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 38 (37 by maintainers)

Commits related to this issue

Most upvoted comments

@chaodaiG good call. Looking at the image:

Created time	
April 9, 2021 at 7:18:01 AM UTC-7
Uploaded time	
April 9, 2021 at 7:20:08 AM UTC-7

Assuming that is reliable, I don’t think its changed