microsoft-authentication-library-for-dotnet: [Bug] Large numbers of concurrent token refresh attempts cause a cache refresh convoy resulting in chronic 429 errors

Describe the bug

In a scenario in which an application does the following:

  • utilizes a WithAppTokenProvider with a callback configured to fetch tokens from a managed identity endpoint
  • Issues hundreds of GetToken requests simultaneously
  • Optionally increases MaxRetries in the HttpClient pipeline used to fetch tokens

When such an application encounters a 429 response from the MI endpoint, this can result in a storm of requests and retry requests making the 429 problem worse. In addition, given the current behavior in MSAL for retries and cache access, all retries are guaranteed to result in a cache miss and will continue to fail as long as the MI endpoint does not return a successful token response.

Expected behavior

Retry attempts after the token cache is successfully refreshed should succeed via a cache hit rather than through a network request to the MI endpoint or authority. Only one request should be made to the endpoint to refresh the cache for any given cache entry and all other concurrent requests should consume that single result.

Actual behavior

Once the initial request fails with a retriable status code, all subsequent token requests do not attempt to read the cache and always result in an additional network request.

Reproduction Steps

Issue a large number of simultaneous GetToken requests with a ManagedIdentityCredential to induce a 429 response from the MI endpoint

Environment

Customer example was in Service Fabric, but this should reproduce in any managed identity environment in which a 429 response is possible.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 27 (9 by maintainers)

Most upvoted comments

Updating just MIcrosoft.Identity.Client package didn’t help

@bgavrilMS - just to make sure - when you wrote “Can you try to upgrade to use MSAL 5.54 or higher?” - you meant 4.54, right?

@gladjohn - did the SQL folks integrate with MI via MSAL.NET ? I guess it doesn’t matter, if they use Azure Identity, they still rely on MSAL for the token caching.

AFAIK, SQL Client only uses PCA from MSAL. And they seem to be using Azure.Identity for MI scenarios.

@bgavrilMS - standard .net core web app that uses MS SQL as storage. As for solution you described - depending on what you mean by ‘block similar requests to the token issuer until one succeeds’ - if by block you mean task block with request to DB then we would be replacing one kind of problem with another one that to external code (my code) differs in nothing - we would still get very long request handling times/timeouts once 24 hours (basically meaning app is unavailable for a minute or two every 24h). If i may propose something then it would be better to spin a single request for new MI token few minutes before old one expires and allow all standard calls (here to DB) to use old token and that request (once finished) should silently replace old MI token in cache.