reactor-netty: First WebClient request always fails after some idle time

Expected behavior

WebClient HTTP request succeeds after 5 minutes of idle time.

Actual behavior

WebClient HTTP request fails after 5 minutes of idle time. If read timeout is set, then I get ReadTimeoutException. If not set, I get the following exception:

java.io.IOException: An existing connection was forcibly closed by the remote host
	at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[na:1.8.0_181]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43) ~[na:1.8.0_181]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_181]
	at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_181]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_181]
	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) ~[netty-buffer-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) ~[netty-buffer-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:617) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:534) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[netty-transport-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[netty-common-4.1.36.Final.jar:4.1.36.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.36.Final.jar:4.1.36.Final]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]

Steps to reproduce

1, Start application: https://github.com/martin-tarjanyi/webclient-bug/blob/master/src/main/java/com/example/webclientbug/WebclientBugApplication.java 2, Call http://localhost:8080/get - request succeeds 3, Wait ~5 minutes 4, Call http://localhost:8080/get - request fails

Reactor Netty version

0.8.9-RELEASE

JVM version (e.g. java -version)

1.8.0_181

OS version (e.g. uname -a)

Windows 10

Related issue

https://stackoverflow.com/questions/55621800/springboot-webclient-throws-an-existing-connection-was-forcibly-closed-by-the-r

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21 (10 by maintainers)

Most upvoted comments

Hey all, just dropping some details about my experience with this. There’s a handful of issues that relate to this issue that I dug up over time in the reactor-netty repo here, I think I will try to go through them and link my comment for visibility.

Issue Details

Specifically, the problem I was running into was the application (which uses reactor-netty) was being routed through an Azure load balancer. The Azure load balancer has some questionable defaults where it:

  1. Sets the max idle time to 4 minutes (alright that’s fine)
  2. By default does not send a TCP RST (close notification) when the idle time is hit

My main issue is with #2. You can read more about it here in Azure where they talk about enabling this setting here but it seems this is not the default (legacy reasons? this feature wasn’t available until about 2018 it seems).

When this flag is disabled the behavior is as follows:

  1. Connection idle for 4 minutes and Azure closes it. It does not notify the other sides of the connection so the client application (using netty-reactor) is unware that the close took place
  2. On the next request Reactor Netty tries to use this stale connection. In our case what happened at this point is we can see a bunch of TCP retransmissions (I think Azure is blackholing the connection rather than sending an immediate close which is also problematic). This would result in the connection to hang for 5-20 seconds before giving up and throwing an exception (also closing/discarding the connection on the client side).

When this flag is enabled the behavior is as follows:

  1. Connection idle for 4 minutes and Azure closes it. Both sides of the connection are notified of the close. Reactor-netty sees this close and presumably discards the connection from the pool
  2. On the next request Reactor Netty asks for a connection from the pool. At this point either there’s a pre-existing live connection in the pool or just a new one is created. Either way, it operates as expected without issue.

So really for us the problematic part was Azure never sending a close notification so the client never knew the connection was stale (also the blackhole behavior rather than just closing immediately wasn’t great, we had application level retries that covered this but it produced a delay then in API calls).

Resolution For our version of this issue it could be resolved in a couple of different ways but with caveats:

Use Reactor-Netty connection max idle time You can configure a max idle time that is lower than the remote network appliances lowest max idle time. Here’s some common ones we’ve seen:

  • Azure load balancer default idle timeout (configurable) = 4 minutes
  • AWS application load balancer default timeout (configurable) = 1 minute (we’ve typically seen this increased though)
  • AWS network load balancer default timeout (not configurable) = 350 seconds
  • AWS Global Accelerator default timeout (not configurable) = 340 seconds

You can configure the Reactor Netty ConnectionProvider with the desired max idle time (adapt in the other settings you want) i.e.

ConnectionProvider connectionProvider = ConnectionProvider.builder("my-http-client")
                .maxIdleTime(maxIdleTime)
                // rest of settings
                .build();

Caveats As you can see the idle timeout varies depending on provider and this is just listing common cloud infrastructure. So it’s hard to have a good default for this without just making it overly aggressive i.e. setting it < 60 seconds. Setting it to be aggressive means you’re establishing new connections potentially more than needed i.e. more SSL handshakes etc. See next section about using TCP keep-alive

Use TCP keep-alive + idle time options Alternatively (or additionally) you can use TCP keep-alive and set TCP_KEEPIDLE to something aggressive enough that a TCP keep-alive message will be sent before the idle timeout of your infrastructure occurs. I.e. in the Azure case if we broadcasted a keep-alive at some time of < 4 minutes of idle time this would prevent Azure from closing the connection. There is an example of these settings in the Reactor Netty documentation here. Note it may make sense to callout in the documentation that this isn’t supported on Windows (beyond just needing Java 11) - I can submit a PR for this (edit see https://github.com/reactor/reactor-netty/pull/1981).

Caveats TCP keep-alive is only supported on:

  1. Java 11+ (technically I think this got backported newer versions of Java 8 but this shouldn’t be relied on)
  2. Only supported on OSX or Linux NOT Windows

For more information see the JDK change that added this here. If you look at the resolution this was only added for OSX and Linux. It is still not supported for Windows as of Java 17.

You could enable TCP keep-alive generically (which is supported on all platforms even in Java 8) and then configure the TCP_KEEPIDLE (idle time to issue keep-alive message) at the operating system level but this is cumbersome/error-prone. The OS default is typically going to be 2 hours of idle time before TCP keep-alive messages start which is not going to be aggressive enough to prevent infrastructure like the Azure load balancer example from severing the connection.

Produce application level keep-alive messages If you have a dummy endpoint you could just hit this periodically but you would have to do it for all connections in the pool. I didn’t really explore this much but seems a bit like re-inventing the wheel i.e. reproducing the concept of TCP keep-alive.

What we did

We ended up just using the max idle setting at the Reactor-Netty pool level and defaulting it to < 4 minutes (220 seconds) to cover our most common lowest denominator (Azure load balancers configured this way). Again, note if the remote side actually sends a close notification this generally is not a problem. However, there is probably some rarer race condition if the remote side hits the idle timeout right as the client side tries to use the connection. You can find some issue descriptions like this, this requires more exact timing though to reproduce/would be rarer although probably good to prevent by tossing these connections out pre-emptively.

If possible though I would also suggest changing the Azure load balancer setting to enable the TCP RST behavior so proper close notifications come through. In my case this was not infrastructure I controlled and would have been larger in scope/higher risk to change that setting.

I’m a developer facing this issue in 2023. I can see how load balancers not sending RST packets can create an unpredictable behaviour. At this date Azure does present a setting to enable sending RST packets, but one would still have to rely on using keep alive probes to extend/reset the load balancer max idle time. But I’m wondering if netty could close and open a new channel instead of closing and returning back an exception… It does look like components that act as passthroughs do not send RST packets and I would argue that 99.9% of the developers that use netty are not aware that they could end up with this sort of unpredictable behaviour in the future if they plugin a load balancer. It would be a more resilient behaviour from netty if it could close and open a new connection in this specific case.

Hi Guys

I am also struggling with this , and I read up the comments but I could not see the solution . Did anyone managed to explicitly close the webclient connection before its auto forced by remote server?