reactor-netty: syscall:read(..) failed: Connection reset by peer

Actual behavior

    @Bean
    public WebClient webClient(ReutersSetting reutersSetting, ExchangeStrategies exchangeStrategies) {
            return WebClient.builder()
                        .baseUrl(ReutersEndPoints.HOST)
                        .defaultHeader(HEADER_APP_ID, reutersSetting.getApplicationId())
                        .exchangeStrategies(exchangeStrategies)
                        .build();
    }

    @Bean
    public ExchangeStrategies exchangeStrategies() {
        ObjectMapper mapper = objectMapper();
        return ExchangeStrategies
                .builder()
                .codecs(clientDefaultCodecsConfigurer -> {
                    clientDefaultCodecsConfigurer.defaultCodecs().jackson2JsonEncoder(new Jackson2JsonEncoder(mapper, MediaType.APPLICATION_JSON));
                    clientDefaultCodecsConfigurer.defaultCodecs().jackson2JsonDecoder(new Jackson2JsonDecoder(mapper, MediaType.APPLICATION_JSON));
                }).build();
    }

    public ObjectMapper objectMapper() {
        return  Jackson2ObjectMapperBuilder
                .json()
                .failOnUnknownProperties(false)
                .featuresToEnable(SerializationFeature.WRAP_ROOT_VALUE)
                .featuresToEnable(DeserializationFeature.UNWRAP_ROOT_VALUE)
                .build();
    }

WebClient initiated via above configuration, throws Exception occasionally such as below.

2019-01-07 11:33:22.188 ERROR [-,,,] 92270 --- [reactor-http-epoll-4] r.n.resources.PooledConnectionProvider   : [id: 0x6f488001, L:/xx.xx.xx.xx:53500 - R:api.trkd.thomsonreuters.com/xx.xx.xx.xx:443] Pooled connection observed an error

io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer
        at io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source)

WebClient is called regularly under Scheduled task.

Steps to reproduce

Happen randomly

Reactor Netty version

reactor-netty:0.8.3.RELEASE

JVM version (e.g. java -version)

openjdk version “11” 2018-09-25 OpenJDK Runtime Environment 18.9 (build 11+28) OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)

OS version (e.g. uname -a)

Linux xxxx 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 12
  • Comments: 36 (9 by maintainers)

Most upvoted comments

Most of the issues here were caused by a proxy/server that closes the connections on some timeout. Because of this we are going to expose a configuration property for switching pool’s lease strategy from FIFO to LIFO (FIFO is by default) #962

Using LIFO leasing strategy + max idle timeout will give you the behaviour below

  • When the connection is acquired it will be the most recently used
    • If max idle timeout is reached this means that this connection will be closed and as this connection was the most recently used this means that all the rest (those that are not active) in the pool will also be closed because of the max idle timeout. A new connection will be created and used for the request.
    • If the connection is closed by the remote peer between acquire and the actual usage - Connection reset by peer will be received and we will retry the request. As this connection was the most recently used and it was closed by the remote peer this mean all the rest (those that are not active) in the pool also will be closed thus again a new connection will be used for the second attempt.

I have the same issue in my spring boot project:

ERROR [reactor-http-epoll-1] [reactor.core.publisher.Operators] Operator called default onErrorDropped io.netty.channel.unix.Errors$NativeIoException: syscall:read(…) failed: Connection reset by peer at io.netty.channel.unix.FileDescriptor.readAddress(…)(Unknown Source)

If I can help you with some additional debug info fill free to ask. Waiting for fix

It was ERROR on the client side. I understood the problem. I use Swarm to orchestrate my dockerized applications. Swarm has a load balancing (LB) to implement its routing mesh. LB closes inactive connections. Enabling of keep alive for channels was not enough because it started after connection closing. Implementation of keep alive on application level for idle connections solved the problem.

We still have a lot of problems due to invalid haproxy configuration… This post helped us a lot to clear most of the issues. Unfortunately we still need to deal with problems on the health cheak. I hope it helps.

I’ve set ChannelOption.SO_KEEPALIVE to false to make it go away as follows:

@Bean
public WebClient webClient(final ClientHttpConnector clientHttpConnector) {
    return WebClient.builder()
        .clientConnector(clientHttpConnector)
        .build();
}

@Bean
public ClientHttpConnector clientHttpConnector(@Value("${webclient.enable-keep-alive}") final boolean keepAlive,
                                               @Value("${webclient.read-timeout-in-seconds}") final int readTimeout,
                                               @Value("${webclient.write-timeout-in-seconds}")
                                               final int writeTimeout) {
    return new ReactorClientHttpConnector(HttpClient.from(TcpClient.create()
        .option(ChannelOption.SO_KEEPALIVE, keepAlive)
        .doOnConnected(connection -> connection
            .addHandlerLast(new ReadTimeoutHandler(readTimeout))
            .addHandlerLast(new WriteTimeoutHandler(writeTimeout)))));
}

Please be aware that you are disabling keep alive connections and it may have and impact on latency. Also check if there is nothing between your client and server that you are trying to hit. I found that any proxy or load balancer in between can complicate your life.