runtime: Mutex.TryOpenExisting intermittently throws IOException

Description

After introducing .NET 7 rc1 SDK into Runtime CI we have started seeing intermittent exceptions System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}' in Runtime and Arcade on Linux CI agents or docker based Linux builds.

Reproduction Steps

I was trying hard to create minimal repro - without success. I believe easiest repro would be to rerun some of our CI’s where it was seen:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=31675&view=logs&j=3fe1f0d5-61d6-5e8f-eead-4d3bcfb9dfc3&t=380e8ab7-dd79-5cad-d265-1eca160e9b82&s=526c4a30-42a9-575e-a58b-243c7c515350

https://dev.azure.com/dnceng-public/public/_build/results?buildId=41146&view=logs&j=190ad6c8-5950-568c-cadd-f2dfb7d5a79f&t=c0f6fdc1-ac5d-583c-8ae1-a18de0846552

Expected behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting shall never intermitently throw IOException.

Actual behavior

new Mutex(initiallyOwned: true, name: "Global\UniqueName", out bool createdNew) and Mutex.TryOpenExisting sometimes throws:

System.IO.IOException: Connection timed out : 'Global\msbuild-server-launch-{45_random-chars}'
     at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
     at Microsoft.Build.Experimental.MSBuildClient.TryLaunchServer()
     at Microsoft.Build.Experimental.MSBuildClient.Execute(CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildClientApp.Execute(String[] commandLine, String msbuildLocation, CancellationToken cancellationToken)
     at Microsoft.Build.CommandLine.MSBuildApp.Main(String[] args)
     at Microsoft.DotNet.Cli.Utils.MSBuildForwardingAppWithoutLogging.ExecuteInProc(String[] arguments)

Regression?

Unknown

Known Workarounds

Unknown.

Configuration

I have seen this mostly on:

  • OSX_x64
  • Linux_musl_x64
  • Linux_x64

Other information

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

what was the reasoning behind it?

I guess nobody noticed that the changes in #70685 have unintended interaction with Win32 emulator PAL uses in CoreLib. It is very hard to keep in mind at all times that a few parts of the CoreLib use the Win32 emulator PAL.

Kusto shows that this issue has happened about 50 times during the last 30 days and it always occurs in the mono wasm legs. It seems the reason it occurs there is that for wasm, the compilation of each test happens during run of the test (when its generated .sh script is called).

The error is ERROR_OPEN_FAILED from Win32 PAL. ERROR_OPEN_FAILED code is 110, the Linux message for error code 110 is “Connection timed out”.

@kouvel This was thrown from here: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Threading/Mutex.Windows.cs#L35 .

Note that the “Connection timed out” message may be bogus. This code is mixing and matching Windows and Unix error codes.

The error comes from msbuild. msbuild always runs on CoreCLR. This is not Mono problem.