earthly: buildkit scheduler error: return leaving incoming open
We’ve been seeing the following error mostly in CI, however, we’ve also seen it on a developer’s machine as well. The issue reproduced on the developer machine several times and only went away after changing the base image of the target (seems sort of random, but perhaps the team can make more sense of it).
Error: build target: build main: failed to solve: unlazy force execution: buildkit scheduler error: return leaving incoming open. Please report this with BUILDKIT_SCHEDULER_DEBUG=1
On CI, the job often fails multiple times in a row if you just re-run it as is. I haven’t really been able to make sense of why it comes and goes yet.
EDIT by the Earthly team: To get unstuck in this situation you can restart buildkit with docker stop earthly-buildkitd.
EDIT 2 by the Earthly team: Possibly related issue: https://github.com/earthly/earthly/issues/2454 (inconsistent graph state error).
EDIT 3 by the Earthly team: This is now a top priority for us. We will provide weekly updates on this issue until it is resolved.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 7
- Comments: 26 (6 by maintainers)
Commits related to this issue
- update buildkit with double-merged-edge fix cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from https://github.com/moby/buildkit/pull/4285 fixes https://github.com/earthly/earthly/issues/295... — committed to earthly/earthly by alexcb 8 months ago
- update buildkit with double-merged-edge fix (#3410) cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from https://github.com/moby/buildkit/pull/4285 fixes https://github.com/earthly/earthly... — committed to earthly/earthly by alexcb 8 months ago
I found a reproduction, and it’s embarrassingly simple!
Previously we thought this race condition was related to high-cpu usage, but it’s not – it’s a combination of a COPY and the wait-end feature.
It can be reproduced fairly consistently with:
I have tested a similar Dockerfile with
buildx, which doesn’t contain the bug.It appears that this bug was introduced with the
--wait-blockfeature (even if theWAIT/ENDcommands are not used).If the Earthfile’s version is changed to:
which is equivalent to VERSION 0.7 except for the wait-block feature, running multiple copies of earthly doesn’t appear to trigger the bug.
We will look more closely into how the wait block feature changes the buildkit scheduler code execution path.
In the meanwhile, if you are not using any WAIT/END blocks, you may want to use the above
VERSION ..... 0.6definition instead.This has been released as of v0.7.21; please let us know if this occurs again.
upstream issue: https://github.com/moby/buildkit/issues/4278
A quick update: We’re still actively working on this and have been pulling in upstream changes into our buildkit fork and have run into some cgroup-related yak shaving.
This issue is now high-priority for the team, as it’s been picking up in frequency lately quite a bit.
The initial upstream merge was done in d919ffa6465b0f5ac09769e4efa8b6e077ab5990; however the upstream fix under buildkit is still being worked on. When it gets merged, we’ll be able to pull it in more easily since we’re much closer to moby/buildkit:master now.
This is a top priority. It seems to be a buildkit upstream bug, and we have a reproduction case (although it’s rather complex and non deterministic). It’s hard to come up with an estimate on when this is fixed, but hopefully soon.
we’re currently working on pulling upstream changes into our fork, so when the upstream issue is resolved it’ll be easier to pull it in.
We have discovered some data race warnings inside buildkit. Here’s some source locations, for example:
In order to produce these warnings, we had to manually enable the
-raceflag during compilation ofbuildkitd– this has prompted https://github.com/moby/buildkit/pull/3994 to make it easier for both us (and the larger buildkit community) to debug these race conditions.It’s not clear if these data races are related to this
buildkit scheduler errorbut it’s a promising lead (and will at the very least increase code quality).We are seeing this on our CI too, but docker stop is not a viable solution for us because we are using the earthly remote builder approach. So, we need a fix or another workaround to avoid that.