azure-sdk-for-java: [BUG] Parallel upload of files can manifest with "hang forever" when a timeout happens
Describe the bug
UPDATE: after many iterations of the issue, @rickle-msft helped us determine that the cause was indeed that the timeout parameter of the uploadFromFile() call was too low. So the workaround of simply not setting a timeout seems to be the answer, unless that is not a viable choice for some reason.
Meanwhile, the original bug report is narrowed in scope to “a timeout condition can manifest as hanging forever”. It is harder to hit this condition in the attached test, but I have found it is possible to get there with a timeout of 100 seconds.
---- ORIGINAL REPORT ------
We are using the ADL2 client library to upload files to Azure storage. This seems to work when a single upload is run, but becomes unstable and often fails when parallel uploads are run. The attached test project:
- Creates a local 100MB temp file full of random bytes
- Uses ExecutorService to upload 8 blobs in parallel using this temp file as the source, by calling DataLakeFileClient.uploadFile().
In addition, leading up to failure or even if the failure manages not to happen for a specific run, it seems to incur large (30-second) gaps where there is no upload traffic at all. Apparently, this is some kind of fail/timeout/retry logic that is malfunctioning.
Exception or Stack Trace
Before the ultimate error, several messages are printed to the logger of the following form:
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x14842126, L:/192.168.58.132:64979 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x34702275, L:/192.168.58.132:64987 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x9ad3ee04, L:/192.168.58.132:64988 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x37569a46, L:/192.168.58.132:64985 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x374a1899, L:/192.168.58.132:64982 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xb26c03cf, L:/192.168.58.132:64986 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x64318283, L:/192.168.58.132:64984 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:34 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x4d37338f, L:/192.168.58.132:64983 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x34ee4913, L:/192.168.58.132:64977 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x58d7f006, L:/192.168.58.132:64981 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xc4af639d, L:/192.168.58.132:64976 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xc1930ecc, L:/192.168.58.132:64978 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x4a6ce04a, L:/192.168.58.132:64996 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x0a867a1f, L:/192.168.58.132:64995 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xe5d2ff75, L:/192.168.58.132:64997 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:44 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x7efe1ce9, L:/192.168.58.132:64980 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:02:54 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x9212a7ef, L:/192.168.58.132:65006 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:04 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x670ef84b, L:/192.168.58.132:65020 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:04 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x5945061c, L:/192.168.58.132:65016 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:15 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xc469aad6, L:/192.168.58.132:65025 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:15 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x5730f5fa, L:/192.168.58.132:65030 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:15 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x294da54d, L:/192.168.58.132:65031 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:15 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x56fb27ba, L:/192.168.58.132:65029 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:15 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xcedd4838, L:/192.168.58.132:65032 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:27 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x8eb45776, L:/192.168.58.132:65055 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:27 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x36e78814, L:/192.168.58.132:65052 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:27 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x598a9b79, L:/192.168.58.132:65056 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:41 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xb89445f0, L:/192.168.58.132:65066 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:03:59 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0x1140e3a0, L:/192.168.58.132:65084 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
Mar 06, 2022 4:05:05 PM io.netty.handler.ssl.SslHandler$9 run
WARNING: [id: 0xebd6a60b, L:/192.168.58.132:65113 - R:rpttest.dfs.core.windows.net/52.184.168.102:443] Last write attempt timed out; force-closing the connection.
The ultimate error and stack trace is the following. Oddly, I have no idea where the 10000ms comes from. I don’t set that timeout and don’t see any place to set the timeout. And the delays observed are much bigger than 10 seconds.
Exception in thread "main" java.util.concurrent.ExecutionException: reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 10000ms in 'source(MonoDefer)' (and no fallback has been configured)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at net.redpoint.test.Test1.main(Test1.java:82)
Caused by: reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 10000ms in 'source(MonoDefer)' (and no fallback has been configured)
at reactor.core.Exceptions.propagate(Exceptions.java:392)
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:137)
at reactor.core.publisher.Mono.block(Mono.java:1731)
at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(StorageImplUtils.java:130)
at com.azure.storage.file.datalake.DataLakeFileClient.uploadFromFile(DataLakeFileClient.java:405)
at net.redpoint.test.Test1.lambda$main$0(Test1.java:77)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.Exception: #block terminated with an error
at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:139)
... 9 more
Caused by: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 10000ms in 'source(MonoDefer)' (and no fallback has been configured)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.handleTimeout(FluxTimeout.java:295)
at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.doTimeout(FluxTimeout.java:280)
at reactor.core.publisher.FluxTimeout$TimeoutTimeoutSubscriber.onNext(FluxTimeout.java:419)
at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:79)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.propagateDelay(MonoDelay.java:271)
at reactor.core.publisher.MonoDelay$MonoDelayRunnable.run(MonoDelay.java:286)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
at java.util.concurrent.FutureTask.run(FutureTask.java)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
... 3 more
To Reproduce Steps to reproduce the behavior:
Build and run the attach maven project. You’ll need to fill in the account, access-key, and container. Note we are uploading across a WAN, but the available bandwidth is usually at least 400mpbs
If you use Windows Task Manager and watch network bandwidth, you will see large gaps with no upload traffic.
In my experience, reducing the executor to a single thread avoids the issue. But then there is no parallelism. And we don’t always have control over when files are uploaded by which caller. I would prefer not serializing all uploads with a mutex.
Code Snippet See complete project. The critical bit is DataLakeFileClient.uploadFile().
Expected behavior
Files upload without error, and without gaps and delays.
Screenshots n/a
Setup (please complete the following information):
- OS: Windows 10 pro
- IDE: IntelliJ
- Library/Libraries: See attached project pom.xml
- Java version: OpenJDK 1.8.0_211
- App Server/Environment: n/a
- Frameworks: n/a
- Maven 3.6.1
- Network: uploading from our own data center, WAN bandwidth is 500mpbs peak.
I do not suspect a dependency issue, although it is possible that the netty version is selected is not correct. See the pom.xml, it is 4.1.72.Final. If I allow the azure-sdk BOM to select its own netty version, I get https://github.com/Azure/azure-sdk-for-java/issues/27472
Additional context
Reducing the number of threads in the ExecutorService to one works around the problem and avoids the warnings and gaps, but is slower, and not really a viable fix. We really don’t want to serialize all uploads, that will reduce responsiveness for small uploads) And what happens if multiple uploads run from separate processes? Will it still fail?
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 30 (11 by maintainers)
At this point if I were to speculate wildly on the root cause, it would be one of these :
We should start a “guess the root cause” betting pool 😃
john