runtime: IOException running NuGet-Migrations during tests in dotnet CLI first run

This is affecting 6.0, 7.0 and 8.0.

In https://github.com/dotnet/runtime/pull/80510#issuecomment-1382160934 we saw a failure because some NuGet migration code failed to run during the dotnet CLI first run experience:

+ dotnet /datadisks/disk1/work/ADFC09B1/p/xunit/xunit.console.dll JIT/HardwareIntrinsics/JIT.HardwareIntrinsics.XUnitWrapper.dll -parallel collections -nocolor -noshadow -xml testResults.xml -trait TestGroup=JIT.HardwareIntrinsics.Arm.ArmBase
Microsoft.DotNet.XUnitConsoleRunner v2.5.0 (64-bit .NET 7.0.2)
  Discovering: JIT.HardwareIntrinsics.XUnitWrapper (method display = ClassAndMethod, method display options = None)
  Discovered:  JIT.HardwareIntrinsics.XUnitWrapper (found 6 of 362 test cases)
  Starting:    JIT.HardwareIntrinsics.XUnitWrapper (parallel test collections = on, max threads = 2)
    JIT/HardwareIntrinsics/Arm/ArmBase.Arm64/ArmBase.Arm64_ro/ArmBase.Arm64_ro.sh [FAIL]
      System.IO.IOException: The system cannot open the device or file specified. : 'NuGet-Migrations'
         at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
         at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
         at NuGet.Common.Migrations.MigrationRunner.Run()
         at Microsoft.DotNet.Configurer.DotnetFirstTimeUseConfigurer.Configure()
         at Microsoft.DotNet.Cli.Program.ConfigureDotNetForFirstTimeUse(IFirstTimeUseNoticeSentinel firstTimeUseNoticeSentinel, IAspNetCertificateSentinel aspNetCertificateSentinel, IFileSentinel toolPathSentinel, Boolean isDotnetBeingInvokedFromNativeInstaller, DotnetFirstRunConfiguration dotnetFirstRunConfiguration, IEnvironmentProvider environmentProvider, Dictionary`2 performanceMeasurements)
         at Microsoft.DotNet.Cli.Program.ProcessArgs(String[] args, TimeSpan startupTime, ITelemetry telemetryClient)
         at Microsoft.DotNet.Cli.Program.Main(String[] args)
      
      Return code:      1

We should look into disabling the first run experience via the DOTNET_SKIP_FIRST_TIME_EXPERIENCE=1 env var, it doesn’t make sense to run that for the tests.

Report

Build Definition Test Pull Request
368683 dotnet/runtime JIT.Methodical.eE.WorkItemExecution dotnet/runtime#90241
362596 dotnet/runtime JIT.Methodical.eE.WorkItemExecution dotnet/runtime#89003
361655 dotnet/runtime JIT.Math.WorkItemExecution dotnet/runtime#89905
361259 dotnet/runtime JIT.Generics.WorkItemExecution
361075 dotnet/runtime JIT.Regression.WorkItemExecution dotnet/runtime#89867
360768 dotnet/runtime JIT.Regression.JitBlue.WorkItemExecution dotnet/runtime#89815
360148 dotnet/runtime JIT.jit64.WorkItemExecution dotnet/runtime#89814
359522 dotnet/runtime JIT.Methodical.WorkItemExecution dotnet/runtime#89805
356421 dotnet/runtime JIT.Methodical.a-dA-D.WorkItemExecution
355235 dotnet/runtime JIT.Methodical.f-iF-I.WorkItemExecution dotnet/runtime#89009
343534 dotnet/runtime JIT.Methodical.WorkItemExecution
342037 dotnet/runtime JIT.Regression.JitBlue.WorkItemExecution dotnet/runtime#89008
341937 dotnet/runtime JIT.Methodical.a-dA-D.WorkItemExecution dotnet/runtime#89003
337982 dotnet/runtime JIT.HardwareIntrinsics.X86.Avx2.WorkItemExecution dotnet/runtime#88809
337742 dotnet/runtime JIT.Regression.CLR-x86-JIT.V1-M12-M13.WorkItemExecution dotnet/runtime#88769

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 1 15

Known Issue Error Message

Fill the error message using known issues guidance.

{
  "ErrorPattern": "The system cannot open the device or file specified. : ('|')NuGet-Migrations('|')",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: 🔎 Result validation: ⚠️ Validation could not be done without an Azure DevOps build URL on the issue. Please add it to the “Build: 🔎” line. Validation performed at: 6/28/2023 10:04:55 PM UTC

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 72 (65 by maintainers)

Commits related to this issue

Most upvoted comments

This is most likely an issue with access rights to the /tmp/.dotnet/shm/ subdirectories. I have seen a similar issue with docker recently. The problem happens if a Mutex is created in a login session of a user with privileged access (like root) and not closed either by mistake in the application or due to its crash. Then later on, in case a user with less privileged access logs in and the session id ends up being the same and the user runs nuget, it will attempt to create the session local mutex “NuGet-Migrations”. Such mutex creates a file in /tmp/.dotnet/shm/session{ID} directory where {ID} is the session id. But in this case, the directory already exists, and it is owned by a privileged user. So, the mutex file cannot be created. The session id is the process id of the login shell.

The current plan after discussion is to revert the previous changes that made session-local mutexes user-specific, and offer user-specific mutexes as a new feature in a future release.

We should look into disabling the first run experience via the DOTNET_SKIP_FIRST_TIME_EXPERIENCE=1 env var, it doesn’t make sense to run that for the tests.

fyi: DOTNET_SKIP_FIRST_TIME_EXPERIENCE=1 lookup was removed in https://github.com/dotnet/cli/commit/7efbf4770630fd370779d222ce9aba35ffe56100 and brought back as DOTNET_NOLOGO=1 in https://github.com/dotnet/sdk/commit/a41c9135e5e23440e106930c7ce949ded2ab3d26. Former is unsupported in all active releases of SDK.

$ find ~/.dotnet -name '*dotnetFirstUseSentinel' -exec rm {} \;
$ dotnet --help
# shows telemetry banner

$ find ~/.dotnet -name '*dotnetFirstUseSentinel' -exec rm {} \;
$ DOTNET_NOLOGO=1 dotnet --help
# no banner

@akoeplinger @steveisok Workarounds did not resolve the issue.

Yes but the xunit-based runtime tests also use the IncludeDotNetCli option in sendtohelixhelp.proj, that seems to be the common thing between WBT and runtime tests that e.g. the libraries tests don’t have (afaik we use corerun there).

Does anyone know if there’s a way to get access to the container image used to run a particular test in the CI?

If you look at the first line of the console log it will say something like this if the test was running in Docker:

Console log: 'System.Runtime.Serialization.Xml.Tests' from job 92398c09-eb7f-4de3-89b5-e52616c28233 (ubuntu.2004.amd64.open.rt) using docker image mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-20.04-helix-wasm-amd64 on a0000M2
running $HELIX_CORRELATION_PAYLOAD/scripts/3db7317da8c649a8ac1b933ed50769e0/execute.sh in /datadisks/disk1/work/A9A208FA/w/B1F30963/e max 1800 seconds

In that case you’d need to use the mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-20.04-helix-wasm-amd64 image.

@kouvel if you need help setting up an environment, feel free to ping me. There’s a test https://dev.azure.com/dnceng-public/public/_build/results?buildId=200752&view=ms.vss-test-web.build-test-results-tab&runId=3764716&resultId=103713&paneView=debug

“HelixJobId”: “08f6d861-acc6-4905-be8a-e8a8e2e6c62a”, “HelixWorkItemName”: “JIT.Math”

That fails directly on the VM. It’s easy enough to set up a VM with the same image and number of cores in the helix-repro lab. The image is ubuntu.1804.amd64 and it’s a 2 core D_v4 (D2a_v4 I believe).

an issue with access rights to the /tmp/.dotnet/shm/ subdirectories

So far, it’s not looking like that’s the issue. I got a repro in the CI with some logging and it looks like before the failing test runs, /tmp/.dotnet/ doesn’t exist, and /tmp/ appears to have the expected permissions.

However, it is possible that the current issue is caused by some other reason. If the issue is easily reproducible in the ci, it would make sense to create a testing PR with added listing of the directory tree and their permissions / users under the /tmp/.dotnet/shm and dumping the current session id. That would reveal if the issue is what I’ve mentioned or something different.

I think we’d want to put this setting into the HelixPreCommands command list so it runs before any of our tests try to execute dotnet. I think putting it into the test wrappers themselves is too late.

I think I can probably imagine some logic that would trigger this before running the actual tests, perhaps as a prolog of the generated legacy XUnit wrappers (for non-merged tests) and the new-style Roslyn-generated full wrappers for the merged tests but it’s unclear to me whether it would fix the problem or just make it less frequent. If it’s the latter I think we should rather try to get to the bottom of why this step is failing on the Helix machines and / or try to find consensus regarding its suppression for the Helix tests. It also doesn’t answer the concern raised by @lewing, namely that removal of the DOTNET_SKIP_FIRST_TIME_EXPERIENCE flag apparently wasn’t fully synchronized with other repos that still set it in their scripts so we still need to figure out whether we can bring it back or whether we need to audit all places that are using it and somehow bring them in sync with its removal; /cc @jkoritzinsky for additional thoughts.

This one is interesting but it’s not the root thing, it just reacts on the fact that the variable isn’t used anywhere:

https://github.com/dotnet/sdk/commit/27b6661ce31434055287d802929c5ba61c82ddb4

I continue looking what gutted this behavior.