lettuce-core: Lettuce cannot recover from connection problems

Bug Report

Current Behavior

During troubleshooting of our production issues with Lettuce and Redis Cluster, we have discovered issues with re-connection of Pub/Sub subscriptions after network problems.

Lettuce is not sending any keep-alive packets on TCP connections dedicated to Pub/Sub subscriptions. Without keep-alives in a rare case of a sudden connection loss to a Redis node, Lettuce is not able to detect that the connection is no longer working. With default OS configuration it will be waiting for hours until OS will close the connection. In the meantime all messages published to a channel will be lost.

Input Code

Minimal code from Lettuce docs is enough to reproduce the issue.

        RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2, node3));

        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(Duration.ofSeconds(15))
                .enableAllAdaptiveRefreshTriggers()
                .build();

        clusterClient.setOptions(ClusterClientOptions.builder()
                .topologyRefreshOptions(topologyRefreshOptions)
                .build());

        StatefulRedisPubSubConnection<String, String> connection = clusterClient.connectPubSub();
        connection.addListener(new RedisPubSubListener<String, String>() { ... } );

        RedisPubSubCommands<String, String> sync = connection.sync();
        sync.subscribe("broadcast");

To reproduce the issue:

  • Start Redis Cluster.
  • Connect to the cluster ans subscribe to the channel using the above code.
  • Find to which server the client is connected using tcpdump or by checking with redis-cli PUBSUB CHANNELS *.
  • Block all network traffic on that server using iptables (killing Redis process is not enough - OS will send FIN packets, and Lettuce will detect a problem and recover the subscription).
  • Redis Cluster will recover the cluster by promoting one of the replicas to the master.
  • Lettuce will not detect that connection is not longer working. And won’t receive messages published to channels. Unused connection will be closed by OS after couple hours, and then Lettuce might me able to fix the problem.

We’ve been able to find issue also in Redis Standalone:

  • Connect to Pub/Sub using Lettuce.
  • Kill traffic on master using iptables. Restart VM with Redis and restore traffic.
  • Lettuce is not detecting an issue and is listening on a dead connection.

Expected behavior/code

Lettuce should be able to detect a broken connection to fix Pub/Sub subscriptions.

Environment

  • Lettuce version(s): 5.3.4.RELEASE
  • Redis version: 5.0.5

Possible Solution

We’ve made similar tests using redis-cli client. The official client is sending keep-alive packets every 15 seconds, and is able to detect connection loss.

It would be best if Lettuce could send keep-alive packets on a Pub/Sub connection to detect network problems. That should enable Lettuce to fix Pub/Sub subscriptions.

Workarounds

We’ve found a workaround for this problem by tweaking OS params (tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), but we would want to avoid changing OS params on all our machines that use Lettuce as a Redis client.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 20 (3 by maintainers)

Most upvoted comments

Thank you for the hint.

I’ve managed to fix the problem by adding netty-transport-native-epoll to a classpath and configuring Netty:

SocketOptions socketOptions = SocketOptions.builder()
	.keepAlive(true)
	.build();

ClientResources clientResources = ClientResources.builder()
	.nettyCustomizer(new NettyCustomizer() {
		@Override
		public void afterBootstrapInitialized(Bootstrap bootstrap) {
			bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
			bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
			bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
		}
	})
	.build();

RedisClient client = RedisClient.create(clientResources, node);
client.setOptions(socketOptions);

I also submitted a bug to Redis: https://github.com/redis/redis/issues/7855 because we think it should be documented a little better. Without above code Pub/Sub will work incorrectly after network issues. It was quite challenging to reproduce and troubleshoot this issue.

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)

The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder()
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the “15 mins connection timeout issue” for over 7 days now, you can try it out as well see if it work for you. Cheers!

Forget to update, we finally fixed this by adding a TCP_USER_TIMEOUT as well (i.e. socket timeout)

The final add on code looks something like this:

ClientResources clientResources = ClientResources.builder()
  .nettyCustomizer(new NettyCustomizer() {
    @Override
    public void afterBootstrapInitialized(Bootstrap bootstrap) {
      bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
      bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
      bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
      // Socket Timeout (milliseconds)
      bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 60000);
    }
  })
  .build();
// Enabled keep alive
SocketOptions socketOptions = SocketOptions.builder()
  .keepAlive(true)
  .build();
ClientOptions clientOptions = ClientOptions.builder()
  .socketOptions(socketOptions)
  .build();

We do not have the “15 mins connection timeout issue” for over 7 days now, you can try it out as well see if it work for you. Cheers!

@NgSekLong What version is your JDK. I seem to have made an error using JDK8

Very Interesting. I found another way to fix this problem! http://libkeepalive.sourceforge.net/

LD_PRELOAD=/the/path/libkeepalive.so \
  > KEEPCNT=20 \
  > KEEPIDLE=180 \
  > KEEPINTVL=60 \
  > java -jar /your/path/yourapp.jar &

Just wanted to notice that this isn’t just a Pub/Sub issue even in Redis:

https://github.com/redis/redis/issues/7855#issuecomment-701212833

It seems that it’s possible to make it work without EPOLL native library using default NIO transport:

bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPIDLE), 15);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPINTERVAL), 5);
bootstrap.option(NioChannelOption.of(ExtendedSocketOptions.TCP_KEEPCOUNT), 3);