sdk: Inconsistent NuGet push order in .NET release publishing

Describe the bug

When new .NET service release is pushed a lot of packages are pushed to the NuGet feed. Among those packages are the workload manifests and the workload packages referenced from those manifests. There is currently no enforced order in which these assets are uploaded which can cause the manifest feed to refer to non-existent packages. This will cause temporary failures on dotnet workload install command.

Example:

Run dotnet workload install macos ios
  dotnet workload install macos ios
  shell: /bin/bash -e {0}
  env:
    DOTNET_ROOT: /Users/teamcity/.dotnet

Skip NuGet package signing validation. NuGet signing validation is not available on Linux or macOS https://aka.ms/workloadskippackagevalidation .
Updated advertising manifest microsoft.net.sdk.tvos.
Updated advertising manifest microsoft.net.sdk.maui.
Updated advertising manifest microsoft.net.sdk.maccatalyst.
Updated advertising manifest microsoft.net.sdk.ios.
Updated advertising manifest microsoft.net.workload.emscripten.
Updated advertising manifest microsoft.net.sdk.android.
Updated advertising manifest microsoft.net.sdk.macos.
Updated advertising manifest microsoft.net.workload.mono.toolchain.
Installing pack Microsoft.NET.Runtime.MonoAOTCompiler.Task version 6.0.2...
Writing workload pack installation record for Microsoft.NET.Runtime.MonoAOTCompiler.Task version 6.0.2...
Installing pack Microsoft.NET.Runtime.MonoTargets.Sdk version 6.0.2...
Writing workload pack installation record for Microsoft.NET.Runtime.MonoTargets.Sdk version 6.0.2...
Installing pack Microsoft.NETCore.App.Runtime.AOT.Cross.ios-arm version 6.0.2...
Workload installation failed. Rolling back installed packs...
Rolling back pack Microsoft.NET.Runtime.MonoAOTCompiler.Task installation...
Uninstalling workload pack Microsoft.NET.Runtime.MonoAOTCompiler.Task version 6.0.2…
Rolling back pack Microsoft.NET.Runtime.MonoTargets.Sdk installation...
Uninstalling workload pack Microsoft.NET.Runtime.MonoTargets.Sdk version 6.0.2…
Rolling back pack Microsoft.NETCore.App.Runtime.AOT.Cross.ios-arm installation...
Workload installation failed: microsoft.netcore.app.runtime.aot.osx-x64.cross.ios-arm::6.0.2 is not found in NuGet feeds https://nuget.pkg.github.com/emclient/index.json;https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-eng/nuget/v3/index.json;https://api.nuget.org/v3/index.json;https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet7/nuget/v3/index.json".
Error: Process completed with exit code 1.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 2
  • Comments: 37 (25 by maintainers)

Most upvoted comments

Thanks for the detailed response @joelverhagen. Glad to know there’s a 2 minute validation window during which we can unlist.

I met with our release team and we have a low cost proposal for the next release. The plan is to publish all .NET packages excluding the manifest packages, poll for availability (which I understand they already do), and then publish the manifest packages as a separate step afterwards. There is potential delay from the polling and there is still the potential for customer impact since the dozen or so manifests would still potentially light up at different times for different customers but it would reduce that window from 30+ minutes down to a much smaller window.

If that goes well, we’ll continue with that option. If there are still issues, we can explore the option of pushing unlisted and then listing all of the manifests together. I’ll be meeting with some Maui folks later today and suggesting they follow the same pattern as their publish is a separate process from the core .NET package publish today.

This problem also shows up with version wildcards and transitive dependencies in our regular dev pipelines, nuget really needs a transactional publish https://github.com/dotnet/performance/issues/3164

Hitting this presently with the current rollout for the ios workload.

sudo dotnet workload install ios
...
Workload(s) 'ios' are already installed.
Skipping NuGet package signature verification.
Installing workload manifest microsoft.net.sdk.ios version 16.2.1024…
Installing workload manifest microsoft.net.sdk.maccatalyst version 16.2.1024…
Installing workload manifest microsoft.net.sdk.macos version 13.1.1024…
Installing workload manifest microsoft.net.sdk.tvos version 16.1.1521…
Installing pack Microsoft.iOS.Sdk version 16.2.1024...
Writing workload pack installation record for Microsoft.iOS.Sdk.net7 version 16.2.1024...
Installing pack Microsoft.iOS.Sdk version 16.2.19...
Workload installation failed. Rolling back installed packs...
Rolling back pack Microsoft.iOS.Sdk installation...
Rolling back pack Microsoft.iOS.Sdk installation...
Uninstalling workload pack Microsoft.iOS.Sdk.net7 version 16.2.1024…
Workload installation failed: microsoft.ios.sdk::16.2.19 is not found in NuGet feeds https://api.nuget.org/v3/index.json".

On closer inspection, it appears that Microsoft.NET.Sdk.iOS.Manifest-7.0.100 version 16.2.1024 contains:

"packs": {
    "Microsoft.iOS.Sdk.net7": {
        "kind": "sdk",
        "version": "16.2.1024",
        "alias-to": {
            "any": "Microsoft.iOS.Sdk"
        }
    },
    "Microsoft.iOS.Sdk.net6": {
        "kind": "sdk",
        "version": "16.2.19",
        "alias-to": {
            "any": "Microsoft.iOS.Sdk"
        }
    },

And it appears that Microsoft.iOS.Sdk version 16.2.1024 has been published, but version 16.2.19 has not - yet.

image

I suggest that there be some validation added such that a manifest can’t be published until all of its dependent packages have been published first.

I agree that workloads are a special case here because of the way advertising manifests are automatically updated during restore.

Maybe we can skip a manifest during that automatic update until it is at least X hours old?

As @baronfel mentioned availability order of packages on NuGet.org is not guaranteed, even if packages are pushed in some clever (e.g. reverse dependency) order. This is because our asynchronous validation pipeline (malware scanning, signing, and more) can take a different amount of time per package. Imagine there is a package in the middle of the dependency graph that takes a bit longer to malware scan or is owned by another team that runs their push at a slightly different time (this happens more than you’d think). It’s sort of a messy problem.

It’s not currently feasible to fix the availability order unless the actor pushing the package polls for package availability before continuing with other package pushes. If done naively this would have horrible throughput (avg validation time * total number of packages to push, so many hours for big .NET releases). It would be possible to essentially do a topological sort to identify sets of packages that can be pushed together but this is all a hack/workaround for the feature gap on NuGet.org.

This general feature request is already tracked here https://github.com/NuGet/NuGetGallery/issues/3931. Feel free to add additional comments or upvote. I think the described solution (“staging”) is the Proper fix for the problem, but it’s a lot of work and needs some exploration with stakeholders to make sure the staging works as needed. I can say it’s not on our radar for the next 6 months given our other priorities and team capacity.

We could potentially control the order of publish to nuget for the runtime workloads so that the packs go first and then the manifests. We could also coordinate a delayed manifest publish. We’d still have a problem if someone caught one of the manifests but not the others since there are half a dozen of them and they got into a partial state. Does nuget.org have the ability to publish first but not make them visible so we could do that all at once?

Responding to @marcpopMSFT’s https://github.com/dotnet/sdk/issues/23820#issuecomment-1421757609, it is possible to publish packages as unlisted. This can be done by using the “unlist” gesture (nuget.exe, .NET CLI, API endpoint) immediately after pushing, while the package is still in the validating state. This will always be the case since validation takes 2+ minutes and you can unlist immediately after the push request completes. Then, after you have detected that packages should be listed (i.e. the fully dependency graph is available), you can use the “relist” gesture (no CLI support, but has API support).

As far as I know nuget.org doesn’t support pushing a package in the unlisted state but there’s an API to unlist that we could call as soon as we push the manifest package and then relist it later once all the other packages are pushed.

Responding to @akoeplinger’s https://github.com/dotnet/sdk/issues/23820#issuecomment-1424349420, yes that’s right. As mentioned to Marc above, you can unlist immediately after push so it can work today. Feel free to open an issue about enhancing the push protocol to include a “listed = false” parameter. This would remove the additional round trip, eliminate any crazy race condition caused by a slow/errored unlist or hyper fast validation, and provide parity with the UI upload flow which currently allows uploading a package as unlisted.

FYI, the best way to detect if a version is available is to use this API endpoint (a HEAD request should be fine): https://learn.microsoft.com/en-us/nuget/api/package-base-address-resource#download-package-content-nupkg. This endpoint has good performance and availability globally and properly caches 200s but not 404s.

Yeah, the underlying issue here is that workloads conceptually provide a mechanism to tie multiple packages together as a unit and our nuget publishing pipeline has no concept of that dependency at any level. To really handle this we’d need something like a staged transaction to the nuget database.