azure-sdk-for-java: [BUG] Call to key vault may hangs forever
I use azure-security-keyvault-secrets v.4.3.6 on java17 application to communicate with Azure Key Vault. Application receives external call, makes some processing, part of which is call to Key Vault, and returns result.
Code for key vault communication is as simple as:
// create client bean once
public SecretClient secretsClient(...) {
return new SecretClientBuilder()
.vaultUrl(...)
.credential(...)
.addPolicy(new TimeoutPolicy(...)) //60 sec
.httpLogOptions(new HttpLogOptions().setLogLevel(HttpLogDetailLevel.BASIC))
.buildClient();
}
...
// make calls for secret
public String fetch(String secretName) {
try {
return secretClient.getSecret(secretName).getValue();
} catch (ResourceNotFoundException e) {
...
}
}
For last two months, on different envs (stage, prod, different regions) periodically I’ve been having this issue:
- Application stops to respond
- Application utilizes 100% CPU
- In thread dump I can see that all working threads are waiting for key vault response (seems forever, there are no timeout exceptions):
"..." #32 prio=5 os_prio=0 cpu=22.22ms elapsed=244.85s tid=0x00007f3ce8061790 nid=0x2a waiting on condition [0x00007f3cdecfa000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
- parking to wait for <0x00000000c14c4a20> (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.2/AbstractQueuedSynchronizer.java:1047)
at java.util.concurrent.CountDownLatch.await(java.base@17.0.2/CountDownLatch.java:230)
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
at reactor.core.publisher.Mono.block(Mono.java:1707)
at com.azure.security.keyvault.secrets.SecretClient.getSecretWithResponse(SecretClient.java:228)
at com.azure.security.keyvault.secrets.SecretClient.getSecret(SecretClient.java:198)
…while reactor threads seems to be very busy on some reactor stuff which probably costs 100% of CPU reactor_thread_dump.txt
Just pods restart didn’t help. After some attempts and experiments I was able to overcome issue by follow list of steps:
- In portal turn off/ turn on firewall for key vault
- Wait for couple of minutes
- Restart pods
I’ve done this several times, so even if steps seems to be messy, that really works for me.
There are probably some ‘accessibility issues’ with key vault. But seems that client library handles it in a not best way possible:
- Timeout didn’t fail response
- After ‘accessibility issues’ with key vault are gone, there still has to be pod restart to reload working threads state
Any ideas or suggestions what could be wrong or how to avoid issue in future are very welcome.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 7
- Comments: 27 (15 by maintainers)
@werder06
My team sees the same issue as you, except when deleting secrets from parallel threads. Deleting seems to be hanging forever, and the app consumes 100% CPU.
Could you please answer a few questions to help us understand the next troubleshooting steps:
SecretAsyncClientinstead ofSecretClient?Our stacktraces
In the thread dump we’re seeing 5 “boundedElastic-N” threads with
keyvault.secrets.SecretClient.beginDeleteSecretin the stacktrace. 4 consume little CPU, and 1 consumes a lot. Their stack traces are a little different.4 threads are WAITING and look like this:
And one very CPU-active thread is RUNNABLE and looks like this:
We’ve merged a PR with a fix for this issue when it involves sharing an
AccessTokenCache. Also, after reviewing your logs one more time, I think was the issue in your case @gukoff. Sorry if I mislead you early on when I mentionedPollerFlux.The fix should be released in a couple of weeks.
Actually, in your particular case @gukoff, I think it’s related to
PollerFlux. Like you said, it’s likely it comes down to having multiple threads attempting to do the same thing while callingMono::repeatWhenEmpty.What is also interesting, we have 4 active CPU-consuming threads in total:
They all, and only they have this line in their stacktraces:
Could it indicate infinite looping through monos?
I see a few candidates for the infinite loop. Basically, whatever calls
Mono::repeatWhenEmpty:AccessTokenCache.getTokenwill infinitely try to get the token from the credential.PowerShell.readData. It will infinitely wait for script output.PollerFlux. It will infinitely try to complete the activation operation as long as it returnsMono.empty().The first one, getting the access token, looks like the most probable cause to me. Everybody in this thread seems to use
ClientSecretCredential, and in our stack dump there is one WAITING thread “Thread-23” hanging at “com.microsoft.aad.msal4j.AuthenticationResultSupplier.get(AuthenticationResultSupplier.java:18)”.If there was a problem where
ClientSecretCredential::getTokenrepeatedly returnedMono.empty(), it would explain the infinite CPU churn ofFluxRepeatWhen. I suppose @vcolin7 came to a similar conclusion, looking at the first comments in this thread 😃 I wonder what could cause such a behavior?..Full stack trace of one of the threads: https://gist.github.com/gukoff/ccec9bc92d769bd1ee0cc483db1746e4 Full stack trace of the hanging identity thread: https://gist.github.com/gukoff/fbc94cfd3cb2fc7988453f474d095b77
Sure thing @joshfree, will look into this first thing tomorrow 😃