hcsshim: docker build freeze at exportLayer phase

This is follow-up of debug I did on https://github.com/docker/for-win/issues/3229

It seems that my issue is slightly different from that one actually.

Setup

  • OS: Windows Server 2019(Headless)
  • docker version: 19.03.2 (client and server)

How to reproduce

When building a large docker container:

FROM mcr.microsoft.com/windows/servercore
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))" 
RUN choco install msys2

Actually my setup is a bit more complicated, as I am building behind corporate proxies, and I have to customize both choco and msys2 package to work in my env. And… I have no way to work without the proxies.

I have the same issue with other large docker images (dockerizing Matlab is the first usecase I have been experiencing this with, but It is even less reproducible in an open-source collaboration environment)

Symptom

When trying to build this container, the docker build freezes just after finishing the “RUN choco install msys2” command.

After many tries, I had some times where the build actually finished pretty much instantly.

I have been trying to reprod this with a simpler setup (a dockerfile creating thousands of file), but was unable to do so.

So I don’t know exactly what triggers what happens to be a race condition

After doing some stack-traces, I observe that the code is stuck in https://github.com/microsoft/hcsshim/blob/bd9b2553245043e79a1037b13afb0091d970a93d/internal/wclayer/exportlayer.go#L74

os.RemoveAll(0xc000d1c000, 0x2a, 0x0, 0x0)
        C:/.GOROOT/src/os/path.go:67 +0x3c
github.com/docker/docker/vendor/github.com/Microsoft/hcsshim/internal/wclayer.(*legacyLayerReaderWrapper).Close(0xc0000ba980, 0xc0000ba980, 0x2546fe0)
        C:/go/src/github.com/docker/docker/vendor/github.com/Microsoft/hcsshim/internal/wclayer/exportlayer.go:74 +0x95
github.com/docker/docker/daemon/graphdriver/windows.(*Driver).exportLayer.func1.1(0x5f8, 0xc00078e000)
        C:/go/src/github.com/docker/docker/daemon/graphdriver/windows/windows.go:672 +0x120

This code is using exportLayer syscall from the winfilter directory to the tmp directory. Then when the tarfile has been produced, it will remove the tmp directory (like \\\\?\\C:\\ProgramData\\docker\\tmp\\hcs425012433) version of the layer.

Workaround

After a bit of back and forth, I did try docker-ci-zap.exe on the hcs425012433 folder, then the docker build command will instantly unfreeze.

So I hacked a new dockerd-dev.exe using following patch.

diff --git a/internal/wclayer/exportlayer.go b/internal/wclayer/exportlayer.go
index 0425b33..0753ff2 100644
--- a/internal/wclayer/exportlayer.go
+++ b/internal/wclayer/exportlayer.go
@@ -71,6 +71,10 @@ type legacyLayerReaderWrapper struct {

 func (r *legacyLayerReaderWrapper) Close() error {
        err := r.legacyLayerReader.Close()
+       // if the layer is not Destroyed at hcs level before removing
+       // we might enter in a race-condition for large containers
+       // which end-up in a hang of the os.RemoveAll() call
+       DestroyLayer(r.root)
        os.RemoveAll(r.root)
        return err
 }

I have no idea if this is the right solution for this problem or this is rather an issue with the windows kernel.

I can submit that patch as a PR if requested

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Reactions: 3
  • Comments: 22 (7 by maintainers)

Commits related to this issue

Most upvoted comments

I will say I 100% did not expect this to turn out to be an actual msys/cygwin bug; I expected to find msys triggering some other bug somewhere. Like, who just writes invalidly-unicode-named files to disk for no particular reason?

My guess is that they picked the Recycle Bin because one exists on each drive (so they can use rename to do the move) while %TEMP% only exists on one drive; these are the only directories I can think of that will be cleaned up by system processes, unless you leverage something like the registry keys that can delete files on next boot.

The directory isn’t really a problem (it’s weird, but okay); it’s the filename that just breaks random things in the world. Poking around, I noticed node.js also has an open bug report for being unable to handle files with invalid UTF-16 on Windows. I imagine anything cross-platform using UTF-8 to hold file paths internally will have exactly this problem.

Side-note: I suspect this is the most-trivial workaround:

FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN choco install -y msys2 && del /f /s /q "C:\$Recycle.Bin"

I wonder if it’s worth requesting that the chocolatey msys2 installer maintainers attempt to delete files starting with .<dc6d><dc73><dc79><dc73> from the current user’s recycle bin after running the “system update” steps.

Anyway, barring a surprise fix in Go or Cygwin/MSYS, I think the best long-term fix/workaround will be https://github.com/microsoft/go-winio/pull/261. Potentially making the os.RemoveAll work like that PR on Windows (i.e. upstreaming it) would work too, but that has a longer delivery time and probably more inertia to overcome.

It might be possible to prevail upon Cygwin/MSYS to just use .cyg and .msys everywhere (like they do on SMB shares, which probably don’t suffer this sort of nonsense) but again, this has lead-time before users benefit, and no one seemed interested in fixing it five years ago when it was first raised. That also only fixes this case, although I don’t know if anyone’s ever reported this issue with something other than pacman in MSYS, so hopefully no one else in the world is sticking non-paired surrogate code-points into files we might try and delete from Go.

tl;dr: It’s an underlying cygwin/msys bug when pacman.exe tries to delete itself during upgrade. Aaaaaaargh.


Yup, repro’d it and under Process Monitor, you can see Go trying to delete a file repeatedly with a faulty name from the recycle bin (in C:$Recycle.Bin\S-1-5-93-2-1, i.e. ContainerAdministrator’s recycle bin).

image

vim says it’s .<dc6d><dc73><dc79><dc73>000100000000061201256577a4f815a1. That’s the same prefix as the example from https://github.com/microsoft/go-winio/pull/261, but that might be a recycle-bin thing? In my case, it’s the pacman.exe. I wonder if that’s always the case?

FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))" 
RUN choco install -y msys2

Windows 11, Docker Desktop 4.16.1, fresh pull of the base image.

Which actually means this is another symptom of https://github.com/moby/moby/issues/32838#issuecomment-354831779 (see https://github.com/microsoft/hcsshim/blob/144a58796678fc3c0360f41bb636516d92c297c2/internal/wclayer/legacy.go#L113-L120); we’re already skipping such files in the exported image as they caused a direct failure, but their mere presence in the staging directory is triggering a latent Go stdlib bug (infinite loop) during staging area cleanup.

And that previous occurrence was also with msys2, but not chocolatey. So I guess the msys2 first pacman -Syu call is putting invalidly-named (or at least, Go-cannot-handle-them-ly-named) files in the recycle bin.

I wonder if a simpler fix (i.e. in hcsshim) would be to not include the Recycle bin in the staging directory, or manually remove it from the staging dir before export, since we don’t export it anyway? (Not sure that’s possible though, without messing with timestamps etc. Maybe targeted hcsshim.DestroyLayer for the recycle bin before export would replace the above fix too.)

It might be worth trying to work out what msys2 is doing that creates this file in the first place, but that’s fixing the known-trigger, not the underlying cause…

Oh, FFS. It’s an msys2/cygwin-specific bug: https://github.com/docker/for-win/issues/8910 and https://github.com/msys2/MSYS2-packages/issues/1253. It’s putting deliberately-invalidly-named files directly into the Recycle bin’s folder.

MSYS2 implementation at https://github.com/msys2/msys2-runtime/blob/9acb704afe9d9d8a090f2391e2bb88add63997c6/winsup/cygwin/syscalls.cc#L299-L322, Cygwin (different prefix) at https://github.com/cygwin/cygwin/blob/a97fbb58e209586f92381e90aaec8fb211287938/winsup/cygwin/syscalls.cc#L299-L322.

I am cross about this, it was first reported (AFAICT) to msys2 just short of five years ago, and was marked as “expected behaviour”.

Supposedly it shouldn’t be happening on Windows 10 1809 or later under Cygwin 3.0 (per https://github.com/msys2/MSYS2-packages/issues/1253#issuecomment-495739230 and https://cygwin.com/cygwin-ug-net/ov-new.html#ov-new3.0) but clearly it is. Poking through the cygwin code, at a minimum running executables end-up falling back to try_to_bin, which is why I’m only seeing pacman.exe in that directory, as it probably tried to delete itself during an upgrade.

Technically, a Go bug that it hangs when it sees this file as part of os.RemoveAll, but PowerShell also hits an System.Text.EncoderFallbackException exception trying to interact with it on the command-line, so I can’t be too harsh.

Elevated cmd can happily delete that one file, and unstick the export.

The fix has indeed landed in Go 1.21.0, so it’s up to hcsshim downstreams to update to Go 1.21 to apply the fix this issue.

Good news: https://github.com/golang/go/issues/59971 has a proposed Go-level fix, which I tested with the following Dockerfile:

FROM mcr.microsoft.com/windows/servercore:ltsc2022
RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
RUN ECHO "NOHANG" && choco install -y msys2

(The ECHO is there so I can invalidate just that line from the build cache)

To test, I built dockerd.exe from v20.10.23 (i.e. the version used by the Docker Desktop version I happened to have installed) with both Go 1.20.4 (latest release) and Go with the relevant fix. This produced the following docker version info:

Go 1.20.4:

Server: Docker Desktop 4.17.1 (101757)
 Engine:
  Version:          0.0.0-dev
  API version:      1.41 (minimum version 1.24)
  Go version:       go1.20.4
  Git commit:       6051f14291-unsupported
  Built:            05/06/2023 05:02:47
  OS/Arch:          windows/amd64
  Experimental:     true

gotip download 493036:

Server: Docker Desktop 4.17.1 (101757)
 Engine:
  Version:          0.0.0-dev
  API version:      1.41 (minimum version 1.24)
  Go version:       devel go1.21-373489978b Fri May 5 18:29:34 2023 +0200
  Git commit:       6051f14291-unsupported
  Built:            05/06/2023 05:00:53
  OS/Arch:          windows/amd64
  Experimental:     true

They’re tagged unsupported because these were local builds (not inside the build container) and because I had to hack on a couple of build scripts to work with PowerShell 7 and to use gotip.

I slipped each build under Docker Desktop as C:\Program Files\Docker\Docker\resources\dockerd.exe because I’m too lazy to configure a dockerd.exe running myself.

Anyway, the Go 1.20.4 build reproduced the hang. I didn’t check Process Monitor (I was cooking noodles, so it had like 10 minutes to unstick itself, and didn’t), but removing the problematic files from C:\ProgramData\Docker\tmp\hcs4184426182\Files\$Recycle.Bin\S-1-5-93-2-1\ immediately unhung it.

C:\Users\paulh\Documents\GoRemoveAll
> docker build -t msys120 .\context\
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM mcr.microsoft.com/windows/servercore:ltsc2022
 ---> d84fae21da7a
Step 2/3 : RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
 ---> Using cache
 ---> 256ff2405873
Step 3/3 : RUN ECHO "HANG" && choco install -y msys2
 ---> Running in db0445b3d5e8
"HANG"
Chocolatey v1.3.1
...
Chocolatey installed 3/3 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
Removing intermediate container db0445b3d5e8
 ---> 819c033a0f2a
Successfully built 819c033a0f2a
Successfully tagged msys120:latest

The gotip download 493036 build did not hang and completed without complaint.

C:\Users\paulh\Documents\GoRemoveAll
> docker build -t msys493036 .\context\
Sending build context to Docker daemon  2.048kB
Step 1/3 : FROM mcr.microsoft.com/windows/servercore:ltsc2022
 ---> d84fae21da7a
Step 2/3 : RUN @powershell -NoProfile -ExecutionPolicy unrestricted -Command "(iex ((new-object net.webclient).DownloadString('https://chocolatey.org/install.ps1')))"
 ---> Using cache
 ---> 256ff2405873
Step 3/3 : RUN ECHO "NOHANG" && choco install -y msys2
 ---> Running in bd2fecc8b6f7
"NOHANG"
Chocolatey v1.3.1
...
Chocolatey installed 3/3 packages.
 See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
Removing intermediate container bd2fecc8b6f7
 ---> a1bd6742367a
Successfully built a1bd6742367a
Successfully tagged msys493036:latest

So yeah, not actually a hcsshim bug after all, just something likely to affect anyone calling hcsshim.NewLayerReader or hcsshim/pkg/ociwclayer.ExportLayerToTar which calls it internally. AFAICT from GitHub Search that’s just hcsshim itself, Docker, containerd, https://github.com/containers/storage, https://github.com/cloudfoundry/diff-exporter, and forks thereof. As such, I think https://github.com/microsoft/Windows-Containers/issues/213 will be the better bug to track the fix migrating out into the ecosystem, once we know what Go versions it’ll actually be available in.

Oh, blast. I just realised I could have tested the fix with just wclayer export rather than a whole build of Docker. >_<

OMG - I’ve had to workaround this issue for years by manually firing up another shell beside the process running docker build and deleting crap from its /c/ProgramData/Docker/tmp directory.

This trick of doing && del /f /s /q “C:$Recycle.Bin” at the end of your msys install (whether by choco or mysys installer directly) seems to work like a charm!!!

It’s possible both things are true, and there’s a badly-named temporary file that gets cleaned up while you sleep, so doesn’t end up in the export directory.


tl;dr: I think the underlying problem is an invalid-UTF16-named file triggering an unterminating loop in os.RemoveAll, when it assumes No such file (caused by failing to roundtrip the invalid UTF16 to string and back) means No such file (caused by the file already having been deleted).

https://github.com/golang/go/issues/36375 is the upstream bug for “No such file for the wrong reasons causes os.RemoveAll to hang”.


I’m not sure, but I wonder if the supposed invalid filename causes this loop in os.RemoveAll to keep attempting to delete the same invalid file, not break out of the loop, and then try again. I’m not certain that’s possible though, unless there’s more than 1024 undeletable files/directories in the one place, in which case this block will never be entered, and the loop will repeatedly fail to delete the same 1024 files endlessly.

So this actually might be a Golang stdlib bug when trying to delete large, undeleteable directories…

Edit: But I couldn’t reproduce a hang with 1030 locked files in a directory. Looking more closely, I can’t see a way to make this loop endless, unless you can have os.Remove return nil or IsNotExist for a file that does exist, in which case I think this code will loop endlessly given a directory with 1024 or more files after that file in the directory list.

Which actually does point back at an “invalid filename” being able to cause this, I guess? Not sure how to reproduce this.

Looking into this, if fixLongPath doesn’t do its job for whatever reason, it’s possible that the DeleteFileW call can return ERROR_FILE_NOT_FOUND if given a long file name in non-universal format.

fixLongPath has a bunch of early-out cases (although none of them are obviously applicable here), so I wonder if it’s just that the MSYS installer has a very deep directory tree (again, possibly in some temporary state), and when combined with os.MkdirTemp("", "hcs") it causes fixLongPath to early-out, and breaks DeleteFileW.

https://github.com/golang/go/issues/36375 is the hang I’m talking about, in this case triggered by long paths, and best-solved with Go 1.17 and “modern” Windows (1607 or later, so all the versions we care about are “modern”), and possibly a linker flag for the Go binary (unless it’s set by-default?).

I also just noticed that the PR contains an example invalid filename, so maybe such a file does break DeleteFileW into returning ERROR_FILE_NOT_FOUND, in which case the same hang seen in https://github.com/golang/go/issues/36375 would occur.

Hmm, on consideration, I suspect that if the filename is invalid UTF-16, then whatever string value comes out of fd.Readdirnames won’t match the file on disk, producing the ERROR_FILE_NOT_FOUND result and hence triggering the hang. So nothing to do with fixLongPath.

Based on https://github.com/microsoft/Windows-Containers/issues/213#issuecomment-1300749036, it looks like the underlying issue here is actually a Go issue, in that os.RemoveAll hangs when it hits something in the msys2 install, probably an invalid filename.

So it should be possible to reproduce this hang with a small Go binary that just calls os.RemoveAll on an msys2 install, I think, outside of containers.

The proposed fix in that comment is replacing os.RemoveAll with a handcoded os.RemoveAll-alike which doesn’t get stuck, which is roughly what the DestroyLayer workaround in the initial commit (and using docker-ci-zap which is a small wrapper around DestroyLayer) is doing, but way down in the OS level, and mostly by conincidence that it doesn’t confirm that you’ve actually given it a WCOW container image layer to destroy.

If that’s not the case, then perhaps there is a problem in the layer-export flow that is creating an invalid filename in the temporary staging area, and that would need to be fixed. However, since apparently the export works when you use docker-ci-zap/DestroyLayer, I assume that’s not the case.

Please, please fix this…