go-redis: Retries may be done using broken connections from the pool

Expected Behavior

When restarting the Redis server, due to the default of MaxRetries: 3, I’d expect go-redis to attempt to reconnect to the server and retry the query.

Current Behavior

If there are enough old connections in the connection pool, all retry attempts are done on broken connections and an error is returned without even attempting to reconnect.

Possible Solution

Some options:

  1. Perform a ping on the connection before reusing it (adds latency if the connection is fine)
  2. Don’t count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query)
  3. Clear the pool on errors (might be overly pessimistic)
  4. Always use a fresh connection for retries

Steps to Reproduce

The following code reproduces the issue. Instead of restarting the server, it sets a dialer that simply closes each connection after 5 seconds.

package main

import (
	"context"
	"github.com/go-redis/redis/v8"
	"log"
	"net"
	"time"
)

func main() {
	ctx := context.Background()

	// Redis client with a dialer that kills connections after 5 seconds (to simulate a server restart).
	client := redis.NewClient(&redis.Options{
		Dialer: func(ctx context.Context, network, addr string) (net.Conn, error) {
			log.Print("Dialer called")
			conn, err := net.Dial(network, addr)
			if err == nil {
				go func() {
					time.Sleep(5 * time.Second)
					conn.Close()
				}()
			}
			return conn, err
		},
	})

	// Function to ping the server and log errors if they occur.
	ping := func() {
		_, err := client.Ping(ctx).Result()
		if err != nil {
			log.Print(err)
		}
	}

	// Perform some pings to fill the connection pool.
	for i := 0; i < 10; i++ {
		go ping()
	}

	// Wait for connections to die.
	time.Sleep(10 * time.Second)

	// Perform another ping and log pool stats before and after.
	log.Printf("%#v", client.PoolStats())
	ping()
	log.Printf("%#v", client.PoolStats())
}

Example output (note that the dialer is not called for the last ping and the hit count in the pool stats increases by 4, i.e. the initial attempt and all 3 retries were done using stale connections from the pool):

2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:41:00 set tcp 127.0.0.1:45526: use of closed network connection
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x4, Misses:0xa, Timeouts:0x0, TotalConns:0x6, IdleConns:0x6, StaleConns:0x0}

You get similar behavior when you restart Redis at the right time instead of using this dialer. In this case, the error will be EOF instead.

When adding MaxRetries: 20 to the redis.Options, the last 3 lines of the output look like this instead (note that there were 10 hits and then 1 miss that called the dialer):

2021/04/23 13:48:46 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:48:49 Dialer called
2021/04/23 13:48:49 &redis.PoolStats{Hits:0xa, Misses:0xb, Timeouts:0x0, TotalConns:0x1, IdleConns:0x1, StaleConns:0x0}

Context (Environment)

I want my application to gracefully handle Redis starts and had hoped that this was handled automatically by go-redis due to the default of MaxRetries: 3.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 9
  • Comments: 18 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Maybe I don’t get the purpose of the retry functionality in go-redis. I’d expect it to handle situations like a Redis restart where all existing connections fail simultaneously, so that I don’t have to handle this in my application code. The current implementation leads to errors returned from go-redis in situations where I would not expect them.

In practice even that is an exaggeration, because there is a small delay before command is retried. So it is unlikely that the command will be retried using bad connections.

That’s only the case if you perform lots of concurrent actions on the pool. If there are situations were only a few goroutines interact with Redis, it is quite likely to receive an error from go-redis even though an attempt on a fresh connection would have worked perfectly fine. Reducing the pool size is not an option if the amount of concurrent Redis queries changes with time.

And the last suggested solution?

Always use a fresh connection for retries

If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?