reactor-netty: First WebClient request always fails after some idle time
Expected behavior
WebClient HTTP request succeeds after 5 minutes of idle time.
Actual behavior
WebClient HTTP request fails after 5 minutes of idle time. If read timeout is set, then I get ReadTimeoutException. If not set, I get the following exception:
java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[na:1.8.0_181]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) ~[na:1.8.0_181]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_181]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_181]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_181]
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[netty-buffer-4.1.36.Final.jar:4.1.36.Final]
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) ~[netty-buffer-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[netty-common-4.1.36.Final.jar:4.1.36.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.36.Final.jar:4.1.36.Final]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
Steps to reproduce
1, Start application: https://github.com/martin-tarjanyi/webclient-bug/blob/master/src/main/java/com/example/webclientbug/WebclientBugApplication.java 2, Call http://localhost:8080/get - request succeeds 3, Wait ~5 minutes 4, Call http://localhost:8080/get - request fails
Reactor Netty version
0.8.9-RELEASE
JVM version (e.g. java -version
)
1.8.0_181
OS version (e.g. uname -a
)
Windows 10
Related issue
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 21 (10 by maintainers)
Hey all, just dropping some details about my experience with this. There’s a handful of issues that relate to this issue that I dug up over time in the reactor-netty repo here, I think I will try to go through them and link my comment for visibility.
Issue Details
Specifically, the problem I was running into was the application (which uses reactor-netty) was being routed through an Azure load balancer. The Azure load balancer has some questionable defaults where it:
My main issue is with #2. You can read more about it here in Azure where they talk about enabling this setting here but it seems this is not the default (legacy reasons? this feature wasn’t available until about 2018 it seems).
When this flag is disabled the behavior is as follows:
When this flag is enabled the behavior is as follows:
So really for us the problematic part was Azure never sending a close notification so the client never knew the connection was stale (also the blackhole behavior rather than just closing immediately wasn’t great, we had application level retries that covered this but it produced a delay then in API calls).
Resolution For our version of this issue it could be resolved in a couple of different ways but with caveats:
Use Reactor-Netty connection max idle time You can configure a max idle time that is lower than the remote network appliances lowest max idle time. Here’s some common ones we’ve seen:
You can configure the Reactor Netty
ConnectionProvider
with the desired max idle time (adapt in the other settings you want) i.e.Caveats As you can see the idle timeout varies depending on provider and this is just listing common cloud infrastructure. So it’s hard to have a good default for this without just making it overly aggressive i.e. setting it < 60 seconds. Setting it to be aggressive means you’re establishing new connections potentially more than needed i.e. more SSL handshakes etc. See next section about using TCP keep-alive
Use TCP keep-alive + idle time options Alternatively (or additionally) you can use TCP keep-alive and set
TCP_KEEPIDLE
to something aggressive enough that a TCP keep-alive message will be sent before the idle timeout of your infrastructure occurs. I.e. in the Azure case if we broadcasted a keep-alive at some time of < 4 minutes of idle time this would prevent Azure from closing the connection. There is an example of these settings in the Reactor Netty documentation here. Note it may make sense to callout in the documentation that this isn’t supported on Windows (beyond just needing Java 11) - I can submit a PR for this (edit see https://github.com/reactor/reactor-netty/pull/1981).Caveats TCP keep-alive is only supported on:
For more information see the JDK change that added this here. If you look at the resolution this was only added for OSX and Linux. It is still not supported for Windows as of Java 17.
You could enable TCP keep-alive generically (which is supported on all platforms even in Java 8) and then configure the
TCP_KEEPIDLE
(idle time to issue keep-alive message) at the operating system level but this is cumbersome/error-prone. The OS default is typically going to be 2 hours of idle time before TCP keep-alive messages start which is not going to be aggressive enough to prevent infrastructure like the Azure load balancer example from severing the connection.Produce application level keep-alive messages If you have a dummy endpoint you could just hit this periodically but you would have to do it for all connections in the pool. I didn’t really explore this much but seems a bit like re-inventing the wheel i.e. reproducing the concept of TCP keep-alive.
What we did
We ended up just using the max idle setting at the Reactor-Netty pool level and defaulting it to < 4 minutes (220 seconds) to cover our most common lowest denominator (Azure load balancers configured this way). Again, note if the remote side actually sends a close notification this generally is not a problem. However, there is probably some rarer race condition if the remote side hits the idle timeout right as the client side tries to use the connection. You can find some issue descriptions like this, this requires more exact timing though to reproduce/would be rarer although probably good to prevent by tossing these connections out pre-emptively.
If possible though I would also suggest changing the Azure load balancer setting to enable the TCP RST behavior so proper close notifications come through. In my case this was not infrastructure I controlled and would have been larger in scope/higher risk to change that setting.
I’m a developer facing this issue in 2023. I can see how load balancers not sending RST packets can create an unpredictable behaviour. At this date Azure does present a setting to enable sending RST packets, but one would still have to rely on using keep alive probes to extend/reset the load balancer max idle time. But I’m wondering if netty could close and open a new channel instead of closing and returning back an exception… It does look like components that act as passthroughs do not send RST packets and I would argue that 99.9% of the developers that use netty are not aware that they could end up with this sort of unpredictable behaviour in the future if they plugin a load balancer. It would be a more resilient behaviour from netty if it could close and open a new connection in this specific case.
Hi Guys
I am also struggling with this , and I read up the comments but I could not see the solution . Did anyone managed to explicitly close the webclient connection before its auto forced by remote server?