go-redis: Retries may be done using broken connections from the pool
Expected Behavior
When restarting the Redis server, due to the default of MaxRetries: 3, I’d expect go-redis to attempt to reconnect to the server and retry the query.
Current Behavior
If there are enough old connections in the connection pool, all retry attempts are done on broken connections and an error is returned without even attempting to reconnect.
Possible Solution
Some options:
- Perform a ping on the connection before reusing it (adds latency if the connection is fine)
- Don’t count retries on connections from the pool towards the retry limit (might lead to a huge number of retries for a single query)
- Clear the pool on errors (might be overly pessimistic)
- Always use a fresh connection for retries
Steps to Reproduce
The following code reproduces the issue. Instead of restarting the server, it sets a dialer that simply closes each connection after 5 seconds.
package main
import (
"context"
"github.com/go-redis/redis/v8"
"log"
"net"
"time"
)
func main() {
ctx := context.Background()
// Redis client with a dialer that kills connections after 5 seconds (to simulate a server restart).
client := redis.NewClient(&redis.Options{
Dialer: func(ctx context.Context, network, addr string) (net.Conn, error) {
log.Print("Dialer called")
conn, err := net.Dial(network, addr)
if err == nil {
go func() {
time.Sleep(5 * time.Second)
conn.Close()
}()
}
return conn, err
},
})
// Function to ping the server and log errors if they occur.
ping := func() {
_, err := client.Ping(ctx).Result()
if err != nil {
log.Print(err)
}
}
// Perform some pings to fill the connection pool.
for i := 0; i < 10; i++ {
go ping()
}
// Wait for connections to die.
time.Sleep(10 * time.Second)
// Perform another ping and log pool stats before and after.
log.Printf("%#v", client.PoolStats())
ping()
log.Printf("%#v", client.PoolStats())
}
Example output (note that the dialer is not called for the last ping and the hit count in the pool stats increases by 4, i.e. the initial attempt and all 3 retries were done using stale connections from the pool):
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:40:50 Dialer called
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:41:00 set tcp 127.0.0.1:45526: use of closed network connection
2021/04/23 13:41:00 &redis.PoolStats{Hits:0x4, Misses:0xa, Timeouts:0x0, TotalConns:0x6, IdleConns:0x6, StaleConns:0x0}
You get similar behavior when you restart Redis at the right time instead of using this dialer. In this case, the error will be EOF instead.
When adding MaxRetries: 20 to the redis.Options, the last 3 lines of the output look like this instead (note that there were 10 hits and then 1 miss that called the dialer):
2021/04/23 13:48:46 &redis.PoolStats{Hits:0x0, Misses:0xa, Timeouts:0x0, TotalConns:0xa, IdleConns:0xa, StaleConns:0x0}
2021/04/23 13:48:49 Dialer called
2021/04/23 13:48:49 &redis.PoolStats{Hits:0xa, Misses:0xb, Timeouts:0x0, TotalConns:0x1, IdleConns:0x1, StaleConns:0x0}
Context (Environment)
I want my application to gracefully handle Redis starts and had hoped that this was handled automatically by go-redis due to the default of MaxRetries: 3.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 9
- Comments: 18 (12 by maintainers)
Commits related to this issue
- Introduce internal/pool.ConnReuseStrategy refs #1737 — committed to Al2Klimov/redis by Al2Klimov 3 years ago
- config.Redis#NewClient(): work-around go-redis/redis#1737 ... by re-trying once more often than there are connections in the pool. — committed to Icinga/icinga-go-library by Al2Klimov 3 years ago
Maybe I don’t get the purpose of the retry functionality in go-redis. I’d expect it to handle situations like a Redis restart where all existing connections fail simultaneously, so that I don’t have to handle this in my application code. The current implementation leads to errors returned from go-redis in situations where I would not expect them.
That’s only the case if you perform lots of concurrent actions on the pool. If there are situations were only a few goroutines interact with Redis, it is quite likely to receive an error from go-redis even though an attempt on a fresh connection would have worked perfectly fine. Reducing the pool size is not an option if the amount of concurrent Redis queries changes with time.
And the last suggested solution?
If and only if something went wrong and we have to retry, do it with a new connection to avoid the risk of this issue?