quarkus: Redis cache: Failed to connect to all nodes of the cluster
Describe the bug
The Redis cache implementation fails to work in cluster mode. Under load, it will thrown an error: Failed to connect to all nodes of the cluster. This can be found in the vert.x redis client here: https://github.com/vert-x3/vertx-redis-client/blob/4.4.6/src/main/java/io/vertx/redis/client/impl/RedisClusterClient.java#L202
Expected behavior
Cluster mode works.
Actual behavior
Cluster mode throws an exception without stacktrace.
ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-2) HTTP Request to /cat-fact failed, error id: 1cd9b164-ba80-4118-b675-ff5cfdfd93eb-869: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster
Setting quarkus.redis.max-pool-size to a higher value seems to postpone the error, but it will still eventually fail.
How to Reproduce?
- Clone repo: https://github.com/bartm-dvb/quarkus-redis-bug
- Start Redis in cluster mode with
docker-compose up ./mvnw quarkus:dev- Start load test with jMeter, configuration file is in the repository.
After about 30 seconds of load testing, you should see
ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-2) HTTP Request to /cat-fact failed, error id: 1cd9b164-ba80-4118-b675-ff5cfdfd93eb-869: io.vertx.core.impl.NoStackTraceThrowable: Failed to connect to all nodes of the cluster
Output of uname -a or ver
No response
Output of java -version
openjdk version “17.0.7” 2023-04-18 OpenJDK Runtime Environment Temurin-17.0.7+7 (build 17.0.7+7) OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)
Quarkus version or git rev
3.5.1
Build tool (ie. output of mvnw --version or gradlew --version)
Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)
Additional information
No response
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 1
- Comments: 19 (15 by maintainers)
Yes, the Vert.x Redis client intentionally doesn’t implement reconnect on error, see https://vertx.io/docs/vertx-redis-client/java/#_implementing_reconnect_on_error We should probably implement something like that in Quarkus. Please file a feature request.
I think there is an issue with reconnects here, let me know if i should file a new issue:
// manual reproduction that consistently reproduces the failure to reconnect:
2023-11-17 22:27:36,551 ERROR [io.qua.ver.htt.run.QuarkusErrorHandler] (vert.x-eventloop-thread-1) HTTP Request to /cat-fact failed, error id: 736e1d37-7d84-43fb-8e37-2d4d34f4eda6-11: io.vertx.core.impl.NoStackTraceThrowable: Cannot connect to any of the provided endpoints"My theory is that when all endpoints of the cluster are down, the slots / endpoints being saved are incorrect and getSlots is always called with index >= endpoints.size.
PRs to Vert.x:
PR to Quarkus:
I can’t see anything else we could do here.
We got a report for something similar when there is a DNS issue. It looks like the connections are not released after an error. I believe the issue is not in Quarkus but in the Vert.x Redis client. @Ladicek should know more, as he recently looked at this code.
I think that investigating what happens when a failure happens in the Vert.x redis client code would be a first great step. There should be exceptionHandlers and I suspect that they are not releasing the connection.
The Vert.x redis client code is in https://github.com/vert-x3/vertx-redis-client/tree/4.4. Select the 4.4 branch - it’s the one used in Quarkus (a forward port should be possible once we find the issue). Build it using
mvn clean install -DskipTests. Then override the version in your project, just add the dependency: