microsoft-authentication-library-for-dotnet: [Bug] Large numbers of concurrent token refresh attempts cause a cache refresh convoy resulting in chronic 429 errors
Describe the bug
In a scenario in which an application does the following:
- utilizes a
WithAppTokenProvider
with a callback configured to fetch tokens from a managed identity endpoint - Issues hundreds of
GetToken
requests simultaneously - Optionally increases
MaxRetries
in the HttpClient pipeline used to fetch tokens
When such an application encounters a 429 response from the MI endpoint, this can result in a storm of requests and retry requests making the 429 problem worse. In addition, given the current behavior in MSAL for retries and cache access, all retries are guaranteed to result in a cache miss and will continue to fail as long as the MI endpoint does not return a successful token response.
Expected behavior
Retry attempts after the token cache is successfully refreshed should succeed via a cache hit rather than through a network request to the MI endpoint or authority. Only one request should be made to the endpoint to refresh the cache for any given cache entry and all other concurrent requests should consume that single result.
Actual behavior
Once the initial request fails with a retriable status code, all subsequent token requests do not attempt to read the cache and always result in an additional network request.
Reproduction Steps
Issue a large number of simultaneous GetToken
requests with a ManagedIdentityCredential
to induce a 429 response from the MI endpoint
Environment
Customer example was in Service Fabric, but this should reproduce in any managed identity environment in which a 429 response is possible.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 27 (9 by maintainers)
Updating just MIcrosoft.Identity.Client package didn’t help
@bgavrilMS - just to make sure - when you wrote “Can you try to upgrade to use MSAL 5.54 or higher?” - you meant 4.54, right?
AFAIK, SQL Client only uses PCA from MSAL. And they seem to be using Azure.Identity for MI scenarios.
@bgavrilMS - standard .net core web app that uses MS SQL as storage. As for solution you described - depending on what you mean by ‘block similar requests to the token issuer until one succeeds’ - if by block you mean task block with request to DB then we would be replacing one kind of problem with another one that to external code (my code) differs in nothing - we would still get very long request handling times/timeouts once 24 hours (basically meaning app is unavailable for a minute or two every 24h). If i may propose something then it would be better to spin a single request for new MI token few minutes before old one expires and allow all standard calls (here to DB) to use old token and that request (once finished) should silently replace old MI token in cache.