istio: Test stability regression
Running summary:
- Last prow update was 5/17 (https://github.com/istio/test-infra/commit/07e88ae5d09da7257ac863e3c5c052bbee754a84). No issues after this until 5/19.
- On 5/19 we updated to kind v0.11.0
- Around 5/19 night we started seeing elevated test failures. These were not isolated to a single failure or job, just generally things not working. A lot of them related to timeouts in api-server, etc.
- Assuming it was a regression in kind, we reverted kind v0.11.0 back to v0.10.0. The errors persisted.
- After analyzing the failures and the cluster, nothing stood out. Failures happened on ~all nodes, node resource utilization look normal, no errors, etc.
- We updated docker to 20.10.6, no improvement
- We updated the build cluster’s k8s version. This was mostly to bounce all of the nodes. No improvement.
- We reverted to the exact docker image prior to kind v0.11.0 (rather than reverting the change in Dockerfile and rebuilding), same issues occurring
Aside from the Kind v0.11.0 update, there were no known changes to any of our infrastructure
The failures seem to have one thing in common: containerd.
- We have seen buildkit fail to connect to local containerd socket (I think, anyhow) with
grpc: the client connection is closing
andno active session for ssyocosy7viqzb53vwjkxf322: context deadline exceeded
- We have seen docker pushes to our local image registry fail logs
- We have seen containerd (in kind) fail to start the etcd container logs
- We have seen containerd (in kind) start the etcd container, but only after health check timeouts are hit logs
- We have seen the kind cluster docker container fail to start with port conflicts logs
I have also checked our internal prow instance (which is a completely distinct build cluster, etc) and we do see the grpc: the client connection is closing
errors; we don’t run kind at all for those.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 38 (37 by maintainers)
Commits related to this issue
- Revert to the exact image we had before A desperate attempt to fix https://github.com/istio/istio/issues/32985 — committed to howardjohn/test-infra by howardjohn 3 years ago
- Revert to the exact image we had before (#3341) A desperate attempt to fix https://github.com/istio/istio/issues/32985 — committed to istio/test-infra by howardjohn 3 years ago
- Skip release test Desperate attempt to see if this improves https://github.com/istio/istio/issues/32985 I may be seeing things but it feels like there is a pattern: https://github.com/istio/istio/is... — committed to howardjohn/test-infra by howardjohn 3 years ago
- Skip release test (#3352) Desperate attempt to see if this improves https://github.com/istio/istio/issues/32985 I may be seeing things but it feels like there is a pattern: https://github.com/istio/... — committed to istio/test-infra by howardjohn 3 years ago
- Set docker max parallelism On recommendation from the docker team For https://github.com/istio/istio/issues/32985 Had to bump to master of buildkit to do this — committed to howardjohn/istio by howardjohn 3 years ago
@chaodaiG good call. Looking at the image:
Assuming that is reliable, I don’t think its changed