hcsshim: docker build freeze at exportLayer phase
This is follow-up of debug I did on https://github.com/docker/for-win/issues/3229
It seems that my issue is slightly different from that one actually.
Setup
- OS: Windows Server 2019(Headless)
- docker version: 19.03.2 (client and server)
How to reproduce
When building a large docker container:
FROM mcr.microsoft.com/windows/servercore
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN choco install msys2
Actually my setup is a bit more complicated, as I am building behind corporate proxies, and I have to customize both choco and msys2 package to work in my env. And… I have no way to work without the proxies.
I have the same issue with other large docker images (dockerizing Matlab is the first usecase I have been experiencing this with, but It is even less reproducible in an open-source collaboration environment)
Symptom
When trying to build this container, the docker build freezes just after finishing the “RUN choco install msys2” command.
After many tries, I had some times where the build actually finished pretty much instantly.
I have been trying to reprod this with a simpler setup (a dockerfile creating thousands of file), but was unable to do so.
So I don’t know exactly what triggers what happens to be a race condition
After doing some stack-traces, I observe that the code is stuck in https://github.com/microsoft/hcsshim/blob/bd9b2553245043e79a1037b13afb0091d970a93d/internal/wclayer/exportlayer.go#L74
os.RemoveAll(0xc000d1c000, 0x2a, 0x0, 0x0)
C:/.GOROOT/src/os/path.go:67 +0x3c
github.com/docker/docker/vendor/github.com/Microsoft/hcsshim/internal/wclayer.(*legacyLayerReaderWrapper).Close(0xc0000ba980, 0xc0000ba980, 0x2546fe0)
C:/go/src/github.com/docker/docker/vendor/github.com/Microsoft/hcsshim/internal/wclayer/exportlayer.go:74 +0x95
github.com/docker/docker/daemon/graphdriver/windows.(*Driver).exportLayer.func1.1(0x5f8, 0xc00078e000)
C:/go/src/github.com/docker/docker/daemon/graphdriver/windows/windows.go:672 +0x120
This code is using exportLayer syscall from the winfilter directory to the tmp directory.
Then when the tarfile has been produced, it will remove the tmp directory (like \\\\?\\C:\\ProgramData\\docker\\tmp\\hcs425012433) version of the layer.
Workaround
After a bit of back and forth, I did try docker-ci-zap.exe on the hcs425012433 folder, then the docker build command will instantly unfreeze.
So I hacked a new dockerd-dev.exe using following patch.
diff --git a/internal/wclayer/exportlayer.go b/internal/wclayer/exportlayer.go
index 0425b33..0753ff2 100644
--- a/internal/wclayer/exportlayer.go
+++ b/internal/wclayer/exportlayer.go
@@ -71,6 +71,10 @@ type legacyLayerReaderWrapper struct {
func (r *legacyLayerReaderWrapper) Close() error {
err := r.legacyLayerReader.Close()
+ // if the layer is not Destroyed at hcs level before removing
+ // we might enter in a race-condition for large containers
+ // which end-up in a hang of the os.RemoveAll() call
+ DestroyLayer(r.root)
os.RemoveAll(r.root)
return err
}
I have no idea if this is the right solution for this problem or this is rather an issue with the windows kernel.
I can submit that patch as a PR if requested
About this issue
- Original URL
- State: open
- Created 5 years ago
- Reactions: 3
- Comments: 22 (7 by maintainers)
Commits related to this issue
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to tardyp/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to tardyp/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Destroy the temporary export layer before removing it Fix: #696 Signed-off-by: Pierre Tardy <pierre.tardy@renault.com> — committed to TBBle/hcsshim by tardyp 5 years ago
- Merge pull request #1 from Ansys-DEE/fixing-devtools Use chocolatey to install VisualStudio VC build tools. Install msys2 and handle post-install hang: https://github.com/microsoft/hcsshim/issues/69... — committed to Ansys-DEE/bazel-windows-docker-container by mobigliani 7 months ago
I will say I 100% did not expect this to turn out to be an actual msys/cygwin bug; I expected to find msys triggering some other bug somewhere. Like, who just writes invalidly-unicode-named files to disk for no particular reason?
My guess is that they picked the Recycle Bin because one exists on each drive (so they can use rename to do the move) while
%TEMP%only exists on one drive; these are the only directories I can think of that will be cleaned up by system processes, unless you leverage something like the registry keys that can delete files on next boot.The directory isn’t really a problem (it’s weird, but okay); it’s the filename that just breaks random things in the world. Poking around, I noticed node.js also has an open bug report for being unable to handle files with invalid UTF-16 on Windows. I imagine anything cross-platform using UTF-8 to hold file paths internally will have exactly this problem.
Side-note: I suspect this is the most-trivial workaround:
I wonder if it’s worth requesting that the chocolatey msys2 installer maintainers attempt to delete files starting with
.<dc6d><dc73><dc79><dc73>from the current user’s recycle bin after running the “system update” steps.Anyway, barring a surprise fix in Go or Cygwin/MSYS, I think the best long-term fix/workaround will be https://github.com/microsoft/go-winio/pull/261. Potentially making the
os.RemoveAllwork like that PR on Windows (i.e. upstreaming it) would work too, but that has a longer delivery time and probably more inertia to overcome.It might be possible to prevail upon Cygwin/MSYS to just use
.cygand.msyseverywhere (like they do on SMB shares, which probably don’t suffer this sort of nonsense) but again, this has lead-time before users benefit, and no one seemed interested in fixing it five years ago when it was first raised. That also only fixes this case, although I don’t know if anyone’s ever reported this issue with something other than pacman in MSYS, so hopefully no one else in the world is sticking non-paired surrogate code-points into files we might try and delete from Go.tl;dr: It’s an underlying cygwin/msys bug when pacman.exe tries to delete itself during upgrade. Aaaaaaargh.
Yup, repro’d it and under Process Monitor, you can see Go trying to delete a file repeatedly with a faulty name from the recycle bin (in
C:$Recycle.Bin\S-1-5-93-2-1, i.e.ContainerAdministrator’s recycle bin).vim says it’s
.<dc6d><dc73><dc79><dc73>000100000000061201256577a4f815a1. That’s the same prefix as the example from https://github.com/microsoft/go-winio/pull/261, but that might be a recycle-bin thing? In my case, it’s the pacman.exe. I wonder if that’s always the case?Windows 11, Docker Desktop 4.16.1, fresh pull of the base image.
Which actually means this is another symptom of https://github.com/moby/moby/issues/32838#issuecomment-354831779 (see https://github.com/microsoft/hcsshim/blob/144a58796678fc3c0360f41bb636516d92c297c2/internal/wclayer/legacy.go#L113-L120); we’re already skipping such files in the exported image as they caused a direct failure, but their mere presence in the staging directory is triggering a latent Go stdlib bug (infinite loop) during staging area cleanup.
And that previous occurrence was also with msys2, but not chocolatey. So I guess the msys2 first
pacman -Syucall is putting invalidly-named (or at least, Go-cannot-handle-them-ly-named) files in the recycle bin.I wonder if a simpler fix (i.e. in hcsshim) would be to not include the Recycle bin in the staging directory, or manually remove it from the staging dir before export, since we don’t export it anyway? (Not sure that’s possible though, without messing with timestamps etc. Maybe targeted
hcsshim.DestroyLayerfor the recycle bin before export would replace the above fix too.)It might be worth trying to work out what msys2 is doing that creates this file in the first place, but that’s fixing the known-trigger, not the underlying cause…
Oh, FFS. It’s an msys2/cygwin-specific bug: https://github.com/docker/for-win/issues/8910 and https://github.com/msys2/MSYS2-packages/issues/1253. It’s putting deliberately-invalidly-named files directly into the Recycle bin’s folder.
MSYS2 implementation at https://github.com/msys2/msys2-runtime/blob/9acb704afe9d9d8a090f2391e2bb88add63997c6/winsup/cygwin/syscalls.cc#L299-L322, Cygwin (different prefix) at https://github.com/cygwin/cygwin/blob/a97fbb58e209586f92381e90aaec8fb211287938/winsup/cygwin/syscalls.cc#L299-L322.
I am cross about this, it was first reported (AFAICT) to msys2 just short of five years ago, and was marked as “expected behaviour”.
Supposedly it shouldn’t be happening on Windows 10 1809 or later under Cygwin 3.0 (per https://github.com/msys2/MSYS2-packages/issues/1253#issuecomment-495739230 and https://cygwin.com/cygwin-ug-net/ov-new.html#ov-new3.0) but clearly it is. Poking through the cygwin code, at a minimum running executables end-up falling back to
try_to_bin, which is why I’m only seeing pacman.exe in that directory, as it probably tried to delete itself during an upgrade.Technically, a Go bug that it hangs when it sees this file as part of
os.RemoveAll, but PowerShell also hits anSystem.Text.EncoderFallbackExceptionexception trying to interact with it on the command-line, so I can’t be too harsh.Elevated
cmdcan happily delete that one file, and unstick the export.The fix has indeed landed in Go 1.21.0, so it’s up to hcsshim downstreams to update to Go 1.21 to apply the fix this issue.
Good news: https://github.com/golang/go/issues/59971 has a proposed Go-level fix, which I tested with the following Dockerfile:
(The
ECHOis there so I can invalidate just that line from the build cache)To test, I built dockerd.exe from
v20.10.23(i.e. the version used by the Docker Desktop version I happened to have installed) with both Go 1.20.4 (latest release) and Go with the relevant fix. This produced the followingdocker versioninfo:Go 1.20.4:
gotip download 493036:They’re tagged unsupported because these were local builds (not inside the build container) and because I had to hack on a couple of build scripts to work with PowerShell 7 and to use
gotip.I slipped each build under Docker Desktop as
C:\Program Files\Docker\Docker\resources\dockerd.exebecause I’m too lazy to configure a dockerd.exe running myself.Anyway, the Go 1.20.4 build reproduced the hang. I didn’t check Process Monitor (I was cooking noodles, so it had like 10 minutes to unstick itself, and didn’t), but removing the problematic files from
C:\ProgramData\Docker\tmp\hcs4184426182\Files\$Recycle.Bin\S-1-5-93-2-1\immediately unhung it.The
gotip download 493036build did not hang and completed without complaint.So yeah, not actually a hcsshim bug after all, just something likely to affect anyone calling
hcsshim.NewLayerReaderorhcsshim/pkg/ociwclayer.ExportLayerToTarwhich calls it internally. AFAICT from GitHub Search that’s just hcsshim itself, Docker, containerd, https://github.com/containers/storage, https://github.com/cloudfoundry/diff-exporter, and forks thereof. As such, I think https://github.com/microsoft/Windows-Containers/issues/213 will be the better bug to track the fix migrating out into the ecosystem, once we know what Go versions it’ll actually be available in.Oh, blast. I just realised I could have tested the fix with just
wclayer exportrather than a whole build of Docker. >_<OMG - I’ve had to workaround this issue for years by manually firing up another shell beside the process running docker build and deleting crap from its /c/ProgramData/Docker/tmp directory.
This trick of doing && del /f /s /q “C:$Recycle.Bin” at the end of your msys install (whether by choco or mysys installer directly) seems to work like a charm!!!
It’s possible both things are true, and there’s a badly-named temporary file that gets cleaned up while you sleep, so doesn’t end up in the export directory.
tl;dr: I think the underlying problem is an invalid-UTF16-named file triggering an unterminating loop in
os.RemoveAll, when it assumesNo such file(caused by failing to roundtrip the invalid UTF16 to string and back) meansNo such file(caused by the file already having been deleted).https://github.com/golang/go/issues/36375 is the upstream bug for “
No such filefor the wrong reasons causesos.RemoveAllto hang”.I’m not sure, but I wonder if the supposed invalid filename causes this loop in
os.RemoveAllto keep attempting to delete the same invalid file, not break out of the loop, and then try again. I’m not certain that’s possible though, unless there’s more than 1024 undeletable files/directories in the one place, in which case this block will never be entered, and the loop will repeatedly fail to delete the same 1024 files endlessly.So this actually might be a Golang stdlib bug when trying to delete large, undeleteable directories…
Edit: But I couldn’t reproduce a hang with 1030 locked files in a directory. Looking more closely, I can’t see a way to make this loop endless, unless you can have
os.RemovereturnnilorIsNotExistfor a file that does exist, in which case I think this code will loop endlessly given a directory with 1024 or more files after that file in the directory list.Which actually does point back at an “invalid filename” being able to cause this, I guess? Not sure how to reproduce this.
Looking into this, if
fixLongPathdoesn’t do its job for whatever reason, it’s possible that the DeleteFileW call can returnERROR_FILE_NOT_FOUNDif given a long file name in non-universal format.fixLongPathhas a bunch of early-out cases (although none of them are obviously applicable here), so I wonder if it’s just that the MSYS installer has a very deep directory tree (again, possibly in some temporary state), and when combined withos.MkdirTemp("", "hcs")it causesfixLongPathto early-out, and breaksDeleteFileW.https://github.com/golang/go/issues/36375 is the hang I’m talking about, in this case triggered by long paths, and best-solved with Go 1.17 and “modern” Windows (1607 or later, so all the versions we care about are “modern”), and possibly a linker flag for the Go binary (unless it’s set by-default?).
I also just noticed that the PR contains an example invalid filename, so maybe such a file does break
DeleteFileWinto returningERROR_FILE_NOT_FOUND, in which case the same hang seen in https://github.com/golang/go/issues/36375 would occur.Hmm, on consideration, I suspect that if the filename is invalid UTF-16, then whatever
stringvalue comes out offd.Readdirnameswon’t match the file on disk, producing theERROR_FILE_NOT_FOUNDresult and hence triggering the hang. So nothing to do withfixLongPath.Based on https://github.com/microsoft/Windows-Containers/issues/213#issuecomment-1300749036, it looks like the underlying issue here is actually a Go issue, in that
os.RemoveAllhangs when it hits something in the msys2 install, probably an invalid filename.So it should be possible to reproduce this hang with a small Go binary that just calls
os.RemoveAllon an msys2 install, I think, outside of containers.The proposed fix in that comment is replacing
os.RemoveAllwith a handcodedos.RemoveAll-alike which doesn’t get stuck, which is roughly what theDestroyLayerworkaround in the initial commit (and using docker-ci-zap which is a small wrapper aroundDestroyLayer) is doing, but way down in the OS level, and mostly by conincidence that it doesn’t confirm that you’ve actually given it a WCOW container image layer to destroy.If that’s not the case, then perhaps there is a problem in the layer-export flow that is creating an invalid filename in the temporary staging area, and that would need to be fixed. However, since apparently the export works when you use docker-ci-zap/
DestroyLayer, I assume that’s not the case.Please, please fix this…