azure-sdk-for-java: [BUG] Call to key vault may hangs forever

I use azure-security-keyvault-secrets v.4.3.6 on java17 application to communicate with Azure Key Vault. Application receives external call, makes some processing, part of which is call to Key Vault, and returns result.

Code for key vault communication is as simple as:

// create client bean once
public SecretClient secretsClient(...) {
    return new SecretClientBuilder()
                .vaultUrl(...)
                .credential(...)
                .addPolicy(new TimeoutPolicy(...)) //60 sec
                .httpLogOptions(new HttpLogOptions().setLogLevel(HttpLogDetailLevel.BASIC))
                .buildClient();
    }

...
// make calls for secret 
    public String fetch(String secretName) {
        try {
            return secretClient.getSecret(secretName).getValue();
        } catch (ResourceNotFoundException e) {
               ...
        }
    }

For last two months, on different envs (stage, prod, different regions) periodically I’ve been having this issue:

  • Application stops to respond
  • Application utilizes 100% CPU
  • In thread dump I can see that all working threads are waiting for key vault response (seems forever, there are no timeout exceptions):
"..." #32 prio=5 os_prio=0 cpu=22.22ms elapsed=244.85s tid=0x00007f3ce8061790 nid=0x2a waiting on condition  [0x00007f3cdecfa000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.2/Native Method)
	- parking to wait for  <0x00000000c14c4a20> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.2/LockSupport.java:211)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.2/AbstractQueuedSynchronizer.java:715)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.2/AbstractQueuedSynchronizer.java:1047)
	at java.util.concurrent.CountDownLatch.await(java.base@17.0.2/CountDownLatch.java:230)
	at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
	at reactor.core.publisher.Mono.block(Mono.java:1707)
	at com.azure.security.keyvault.secrets.SecretClient.getSecretWithResponse(SecretClient.java:228)
	at com.azure.security.keyvault.secrets.SecretClient.getSecret(SecretClient.java:198)

…while reactor threads seems to be very busy on some reactor stuff which probably costs 100% of CPU reactor_thread_dump.txt

Just pods restart didn’t help. After some attempts and experiments I was able to overcome issue by follow list of steps:

  • In portal turn off/ turn on firewall for key vault
  • Wait for couple of minutes
  • Restart pods

I’ve done this several times, so even if steps seems to be messy, that really works for me.

There are probably some ‘accessibility issues’ with key vault. But seems that client library handles it in a not best way possible:

  • Timeout didn’t fail response
  • After ‘accessibility issues’ with key vault are gone, there still has to be pod restart to reload working threads state

Any ideas or suggestions what could be wrong or how to avoid issue in future are very welcome.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 7
  • Comments: 27 (15 by maintainers)

Most upvoted comments

@werder06

My team sees the same issue as you, except when deleting secrets from parallel threads. Deleting seems to be hanging forever, and the app consumes 100% CPU.

Could you please answer a few questions to help us understand the next troubleshooting steps:

  • Did you manage to resolve this bug?
  • Do you use opentelemetry in the app?
  • Have you tried to use SecretAsyncClient instead of SecretClient?
  • Have you tried adding HttpClient-level timeouts as follows:
var httpClientOptions = new HttpClientOptions();
httpClientOptions.setConnectTimeout(Duration.ofSeconds(1));
httpClientOptions.setReadTimeout(Duration.ofSeconds(1));
httpClientOptions.setResponseTimeout(Duration.ofSeconds(1));
httpClientOptions.setWriteTimeout(Duration.ofSeconds(1));
httpClientOptions.setConnectionIdleTimeout(Duration.ofSeconds(1));

var client = new SecretClientBuilder()
    .vaultUrl(VaultUrl)
    .addPolicy(new TimeoutPolicy(Duration.ofSeconds(100)))
    .credential(credential)
    .clientOptions(httpClientOptions)
    .buildClient();

Our stacktraces

In the thread dump we’re seeing 5 “boundedElastic-N” threads with keyvault.secrets.SecretClient.beginDeleteSecret in the stacktrace. 4 consume little CPU, and 1 consumes a lot. Their stack traces are a little different.

4 threads are WAITING and look like this:

"boundedElastic-12" #105 daemon prio=5 os_prio=0 cpu=164.26ms elapsed=4943.77s allocated=34273K defined_classes=75 tid=0x00007fa8cf06e800 nid=0x7c waiting on condition  [0x00007fa82b897000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
	- parking to wait for  <0x00000000e051c530> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.16/Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@11.0.16/Unknown Source)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@11.0.16/Unknown Source)
	at java.util.concurrent.CountDownLatch.await(java.base@11.0.16/Unknown Source)
	at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
	at reactor.core.publisher.Mono.block(Mono.java:1706)
	at com.azure.core.util.polling.PollerFlux.lambda$new$1(PollerFlux.java:289)
	at com.azure.core.util.polling.PollerFlux$$Lambda$2388/0x0000000841318c40.apply(Unknown Source)
	at com.azure.core.util.polling.DefaultSyncPoller.<init>(DefaultSyncPoller.java:71)
	at com.azure.core.util.polling.PollerFlux.getSyncPoller(PollerFlux.java:467)
	at com.azure.security.keyvault.secrets.SecretClient.beginDeleteSecret(SecretClient.java:337)

And one very CPU-active thread is RUNNABLE and looks like this:

"boundedElastic-11" #104 daemon prio=5 os_prio=0 cpu=2387754.05ms elapsed=4943.77s allocated=353G defined_classes=108 tid=0x00007fa8ce631000 nid=0x7b runnable  [0x00007fa82b996000]
   java.lang.Thread.State: RUNNABLE
	at reactor.core.publisher.FluxLiftFuseable.subscribeOrReturn(FluxLiftFuseable.java:58)
	at reactor.core.publisher.Flux.subscribe(Flux.java:8454)
	at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.drain(FluxConcatMap.java:449)
	at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.innerComplete(FluxConcatMap.java:296)
	at reactor.core.publisher.FluxConcatMap$ConcatMapInner.onComplete(FluxConcatMap.java:885)
	at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2400)
	at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.request(Operators.java:2158)
	at reactor.core.publisher.FluxConcatMap$ConcatMapImmediate.request(FluxConcatMap.java:330)
	at reactor.core.publisher.Operators$DeferredSubscription.request(Operators.java:1680)
	at reactor.core.publisher.FluxRepeatWhen$RepeatWhenMainSubscriber.onComplete(FluxRepeatWhen.java:164)
	at io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber$$Lambda$1617/0x0000000840fb3440.run(Unknown Source)
	at io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber.withActiveSpan(TracingSubscriber.java:79)
	at io.opentelemetry.javaagent.shaded.instrumentation.reactor.TracingSubscriber.onComplete(TracingSubscriber.java:68)
	at reactor.core.publisher.Operators.complete(Operators.java:137)
	at reactor.core.publisher.MonoEmpty.subscribe(MonoEmpty.java:46)
	at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
	at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
	at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
	at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
	at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
	at reactor.core.publisher.Mono.subscribe(Mono.java:4399)
	at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:255)
	at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
	at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
	at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52)
	at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:64)
	at reactor.core.publisher.MonoDeferContextual.subscribe(MonoDeferContextual.java:55)
	at reactor.core.publisher.Mono.subscribe(Mono.java:4399)
	at reactor.core.publisher.Mono.block(Mono.java:1705)
	at com.azure.core.util.polling.PollerFlux.lambda$new$1(PollerFlux.java:289)
	at com.azure.core.util.polling.PollerFlux$$Lambda$2388/0x0000000841318c40.apply(Unknown Source)
	at com.azure.core.util.polling.DefaultSyncPoller.<init>(DefaultSyncPoller.java:71)
	at com.azure.core.util.polling.PollerFlux.getSyncPoller(PollerFlux.java:467)
	at com.azure.security.keyvault.secrets.SecretClient.beginDeleteSecret(SecretClient.java:337)

We’ve merged a PR with a fix for this issue when it involves sharing an AccessTokenCache. Also, after reviewing your logs one more time, I think was the issue in your case @gukoff. Sorry if I mislead you early on when I mentioned PollerFlux.

The fix should be released in a couple of weeks.

Actually, in your particular case @gukoff, I think it’s related to PollerFlux. Like you said, it’s likely it comes down to having multiple threads attempting to do the same thing while calling Mono::repeatWhenEmpty.

What is also interesting, we have 4 active CPU-consuming threads in total:

"boundedElastic-11" #104 daemon prio=5 os_prio=0 cpu=2387754.05ms elapsed=4943.77s
"reactor-http-epoll-1" #69 daemon prio=5 os_prio=0 cpu=2389104.61ms elapsed=4956.52s
"reactor-http-epoll-2" #70 daemon prio=5 os_prio=0 cpu=2388567.48ms elapsed=4956.52s
"reactor-http-epoll-4" #72 daemon prio=5 os_prio=0 cpu=2388859.10ms elapsed=4956.52s

They all, and only they have this line in their stacktraces:

FluxRepeatWhen$RepeatWhenMainSubscriber.onComplete(FluxRepeatWhen.java:164)

Could it indicate infinite looping through monos?


I see a few candidates for the infinite loop. Basically, whatever calls Mono::repeatWhenEmpty:

  • getting access token to Azure. AccessTokenCache.getToken will infinitely try to get the token from the credential.
  • getting powershell output with PowerShell.readData. It will infinitely wait for script output.
  • subscribing to PollerFlux. It will infinitely try to complete the activation operation as long as it returns Mono.empty().

The first one, getting the access token, looks like the most probable cause to me. Everybody in this thread seems to use ClientSecretCredential, and in our stack dump there is one WAITING thread “Thread-23” hanging at “com.microsoft.aad.msal4j.AuthenticationResultSupplier.get(AuthenticationResultSupplier.java:18)”.

If there was a problem where ClientSecretCredential::getToken repeatedly returned Mono.empty(), it would explain the infinite CPU churn of FluxRepeatWhen. I suppose @vcolin7 came to a similar conclusion, looking at the first comments in this thread 😃 I wonder what could cause such a behavior?..


Full stack trace of one of the threads: https://gist.github.com/gukoff/ccec9bc92d769bd1ee0cc483db1746e4 Full stack trace of the hanging identity thread: https://gist.github.com/gukoff/fbc94cfd3cb2fc7988453f474d095b77

Sure thing @joshfree, will look into this first thing tomorrow 😃