kind: Very infrequent "failed to load image: exit status 1" errors

What happened: We occasionally are seeing issues load kind images. From a rough grep I think this is impacting roughly 3% of our PRs. Note that each PR is running ~20 tests and load ~10 images, and maybe be rerun many times due to test failures or new commits, etc. So this number actually means kind load is likely only failing .003% of the time I guess?

What you expected to happen:

Ideally, kind load is more robust and doesn’t experience errors. But if that is not feasible it may be nice to have some better logging/error messages, maybe retries possibly? I’m not too sure as I don’t yet understand the root cause.

How to reproduce it (as minimally and precisely as possible):

Its very intermittent failures, so I am not sure we can reproduce it easily. I can however, point you to a bunch of logs:

We do everything with loglevel=debug and dump kind logs in Artifacts so hopefully we have everything there. I didn’t really look through the logs much as I don’t know what to look for, but happy to look deeper if I am pointed in the right direction.

Anything else we need to know?:

As mentioned, this is pretty rare. A 99.999% pass rate is pretty solid, so I wouldn’t be too disappointing if nothing could be done here.

Environment:

  • kind version: (use kind version): 0.5.1
  • Kubernetes version: (use kubectl version): Running Kind on GKE 1.13. I think all of these are spinning up 1.15 clusters
  • Docker version: (use docker info): 18.06.1
  • OS (e.g. from /etc/os-release): cOS

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

I switched to loading images 1 at a time instead of in parallel, and at the same time our testing load roughly doubled due to an incoming release. Load failures seem about the same if not a little bit worse.

Sounds like the next step would be to try the newer versions of kind. I was planning to wait for v0.6.0, would you suggest we just switch to master now?

Retry is a good option, seems a worthwhile tradeoff. I’ll try that out and if I see it again update to some commit on master. Thanks!