earthly: buildkit scheduler error: return leaving incoming open

We’ve been seeing the following error mostly in CI, however, we’ve also seen it on a developer’s machine as well. The issue reproduced on the developer machine several times and only went away after changing the base image of the target (seems sort of random, but perhaps the team can make more sense of it).

Error: build target: build main: failed to solve: unlazy force execution: buildkit scheduler error: return leaving incoming open. Please report this with BUILDKIT_SCHEDULER_DEBUG=1

On CI, the job often fails multiple times in a row if you just re-run it as is. I haven’t really been able to make sense of why it comes and goes yet.

EDIT by the Earthly team: To get unstuck in this situation you can restart buildkit with docker stop earthly-buildkitd.

EDIT 2 by the Earthly team: Possibly related issue: https://github.com/earthly/earthly/issues/2454 (inconsistent graph state error).

EDIT 3 by the Earthly team: This is now a top priority for us. We will provide weekly updates on this issue until it is resolved.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 7
  • Comments: 26 (6 by maintainers)

Commits related to this issue

Most upvoted comments

I found a reproduction, and it’s embarrassingly simple!

VERSION 0.7

foo:
  FROM alpine:3.17
  COPY my-file .
  RUN true

Previously we thought this race condition was related to high-cpu usage, but it’s not – it’s a combination of a COPY and the wait-end feature.

It can be reproduced fairly consistently with:

$ date > my-file; earthly-v0.7.19 +foo & earthly-v0.7.19 +foo
[1] 1261981
 Init 🚀
————————————————————————————————————————————————————————————————————————————————

 Init 🚀
————————————————————————————————————————————————————————————————————————————————

           buildkitd | Starting buildkit daemon as a docker container (earthly-buildkitd)...
           buildkitd | ...Done
           buildkitd | Found buildkit daemon as docker container (earthly-buildkitd)

Streaming logs to https://cloud.earthly.dev/builds/e6ce6c80-0d56-4eb1-a86d-e6de171732e8

 Build 🔧
————————————————————————————————————————————————————————————————————————————————

         alpine:3.17 | --> Load metadata alpine:3.17 linux/amd64

Streaming logs to https://cloud.earthly.dev/builds/3fb6298f-a6eb-4bbc-a687-4293ae62050f

 Build 🔧
————————————————————————————————————————————————————————————————————————————————

         alpine:3.17 | --> Load metadata alpine:3.17 linux/amd64
                +foo | --> FROM alpine:3.17
--> FROM alpine:3.17
[----------] 100% FROM alpine:3.17
                +foo | [----------] 100% FROM alpine:3.17
                +foo | --> COPY my-file .
                +foo | --> RUN true
              output | --> exporting outputs
View logs at https://cloud.earthly.dev/builds/e6ce6c80-0d56-4eb1-a86d-e6de171732e8
            _unknown *failed* | build target: build main: failed to solve: buildkit scheduler error: return leaving incoming open. Please report this with BUILDKIT_SCHEDULER_DEBUG=1
Error: build target: build main: failed to solve: buildkit scheduler error: return leaving incoming open. Please report this with BUILDKIT_SCHEDULER_DEBUG=1

I have tested a similar Dockerfile with buildx, which doesn’t contain the bug.

It appears that this bug was introduced with the --wait-block feature (even if the WAIT / END commands are not used).

If the Earthfile’s version is changed to:

VERSION \
  --check-duplicate-images \
  --earthly-git-author-args \
  --earthly-locally-arg \
  --earthly-version-arg \
  --explicit-global \
  --new-platform \
  --no-tar-build-output \
  --save-artifact-keep-own \
  --shell-out-anywhere \
  --use-cache-command \
  --use-chmod \
  --use-copy-link \
  --use-host-command \
  --use-no-manifest-list \
  --use-pipelines \
  --use-project-secrets \
  #--wait-block \
  0.6

which is equivalent to VERSION 0.7 except for the wait-block feature, running multiple copies of earthly doesn’t appear to trigger the bug.

We will look more closely into how the wait block feature changes the buildkit scheduler code execution path.

In the meanwhile, if you are not using any WAIT/END blocks, you may want to use the above VERSION ..... 0.6 definition instead.

This has been released as of v0.7.21; please let us know if this occurs again.

A quick update: We’re still actively working on this and have been pulling in upstream changes into our buildkit fork and have run into some cgroup-related yak shaving.

This issue is now high-priority for the team, as it’s been picking up in frequency lately quite a bit.

The initial upstream merge was done in d919ffa6465b0f5ac09769e4efa8b6e077ab5990; however the upstream fix under buildkit is still being worked on. When it gets merged, we’ll be able to pull it in more easily since we’re much closer to moby/buildkit:master now.

This is a top priority. It seems to be a buildkit upstream bug, and we have a reproduction case (although it’s rather complex and non deterministic). It’s hard to come up with an estimate on when this is fixed, but hopefully soon.

we’re currently working on pulling upstream changes into our fork, so when the upstream issue is resolved it’ll be easier to pull it in.

We have discovered some data race warnings inside buildkit. Here’s some source locations, for example:

$ docker logs earthly-dev-buildkitd 2>&1 | grep -A 2 'WARNING: DATA RACE' | grep 'github\.com/moby' | sort | uniq
  github.com/moby/buildkit/cache/contenthash.(*cacheManager).GetCacheContext()
  github.com/moby/buildkit/executor/resources.(*Sampler[...]).run()
  github.com/moby/buildkit/executor/resources.(*Sub[...]).Close()
  github.com/moby/buildkit/executor/resources/types.(*Sample).Timestamp()
  github.com/moby/buildkit/executor/resources/types.(*SysCPUStat).MarshalJSON()
  github.com/moby/buildkit/executor/resources/types.(*SysSample).Timestamp()
  github.com/moby/buildkit/solver.(*Job).Build()
  github.com/moby/buildkit/solver.(*Job).RegisterCompleteTime()
  github.com/moby/buildkit/solver.(*Solver).connectProgressFromState()
  github.com/moby/buildkit/util/network/cniprovider.(*cniNS).Close()
  github.com/moby/buildkit/util/network/cniprovider.(*cniNS).sample()
  github.com/moby/buildkit/util/progress.(*Progress).Meta()
  github.com/moby/buildkit/util/progress.(*progressWriter).WriteRawProgress()
  github.com/moby/buildkit/util/pull/pullprogress.trackProgress()

In order to produce these warnings, we had to manually enable the -race flag during compilation of buildkitd – this has prompted https://github.com/moby/buildkit/pull/3994 to make it easier for both us (and the larger buildkit community) to debug these race conditions.

It’s not clear if these data races are related to this buildkit scheduler error but it’s a promising lead (and will at the very least increase code quality).

We are seeing this on our CI too, but docker stop is not a viable solution for us because we are using the earthly remote builder approach. So, we need a fix or another workaround to avoid that.