kind: Very infrequent "failed to load image: exit status 1" errors
What happened:
We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means kind load
is likely only failing .003% of the time I guess?
What you expected to happen:
Ideally, kind load
is more robust and doesn’t experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I’m not too sure as I don’t yet understand the root cause.
How to reproduce it (as minimally and precisely as possible):
Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17113/e2e-mixer-no_auth_istio/472
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17179/e2e-simpleTests-distroless_istio/504
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17235/integ-pilot-k8s-tests_istio/600
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17275/integ-pilot-k8s-tests_istio/670
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17281/pilot-e2e-envoyv2-v1alpha3_istio/673
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17288/e2e-bookInfoTests-envoyv2-v1alpha3_istio/1246
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17288/e2e-simpleTests-cni_istio/667
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17322/e2e-simpleTests_istio/1035
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17330/integ-new-install-k8s-tests_istio/820
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17330/integ-telemetry-k8s-tests_istio/1237
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17355/integ-new-install-k8s-tests_istio/674
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17360/integ-mixer-k8s-tests_istio/725
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17364/integ-pilot-k8s-tests_istio/842
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17365/integ-security-k8s-tests_istio/1305
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17394/integ-pilot-k8s-tests_istio/998
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17478/integ-framework-k8s-tests_istio/1236
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17488/e2e-simpleTests-cni_istio/1181
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17488/integ-mixer-k8s-tests_istio/1188
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17529/e2e-bookInfoTests-envoyv2-v1alpha3_istio/1237
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17569/e2e-simpleTests_istio/1326
- https://prow.istio.io/view/gcs/istio-prow/pr-logs/pull/istio_istio/17608/e2e-bookInfoTests-trustdomain_istio/1412
We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn’t really look through the logs much as I don’t know what to look for, but happy to look deeper if I am pointed in the right direction.
Anything else we need to know?:
As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn’t be too disappointing if nothing could be done here.
Environment:
- kind version: (use
kind version
): 0.5.1 - Kubernetes version: (use
kubectl version
): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters - Docker version: (use
docker info
): 18.06.1 - OS (e.g. from
/etc/os-release
): cOS
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (15 by maintainers)
I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse.
Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now?
Retry is a good option, seems a worthwhile tradeoff. I’ll try that out and if I see it again update to some commit on master. Thanks!