nats-server: Leafnodes getting disconnected on high traffic JS cluster with WriteDeadline exceeded error
Context and setup
- Have a 3 node hub cluster which connects to 4 different leaf nodes where each silo contains single node JS enabled.
- To avoid flowing data from one silo to another silo node via hub I have created a bridge account with ACL (referred from: https://www.youtube.com/watch?v=0MkS_S7lyHk.
Observation
- Leaf node connections gets disconnected with WriteDeadline exceeded error.
- Have increased the deadline from default to
60s
but still observing this when traffic flowing through is really high. - Pull consumers connecting to streams on this hub cluster timeout while consuming and acks piles up.
- If leaf nodes are directly connected without cross account setup then this is not observed.
Here is the logs.
[9316] 2022/05/02 15:18:26.664422 [INF] 123.108.38.132:4333 - lid:2693 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
[9316] 2022/05/02 15:18:27.678241 [INF] 123.108.38.133:4333 - lid:2695 - Leafnode connection created for account: NEST_BRIDGE
[9316] 2022/05/02 15:19:16.773775 [INF] 123.108.38.135:4333 - lid:2694 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 23192 chunks of 1519850264 total bytes.
[9316] 2022/05/02 15:19:16.774039 [INF] 123.108.38.135:4333 - lid:2694 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 33680 chunks of 2207071394 total bytes.
[9316] 2022/05/02 15:19:16.774075 [INF] 123.108.38.135:4333 - lid:2694 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
[9316] 2022/05/02 15:19:17.783783 [INF] 123.108.38.134:4333 - lid:2696 - Leafnode connection created for account: NEST_BRIDGE
[9316] 2022/05/02 15:21:01.854899 [INF] 123.108.38.133:4333 - lid:2695 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 24624 chunks of 1613745557 total bytes.
[9316] 2022/05/02 15:21:01.855290 [INF] 123.108.38.133:4333 - lid:2695 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 42582 chunks of 2790522385 total bytes.
[9316] 2022/05/02 15:21:01.855304 [INF] 123.108.38.133:4333 - lid:2695 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
[9316] 2022/05/02 15:21:02.870848 [INF] 123.108.38.132:4333 - lid:2697 - Leafnode connection created for account: NEST_BRIDGE
[9316] 2022/05/02 15:21:13.107117 [INF] 123.108.38.134:4333 - lid:2696 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 17525 chunks of 1148463352 total bytes.
[9316] 2022/05/02 15:21:13.107466 [INF] 123.108.38.134:4333 - lid:2696 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 36860 chunks of 2415455897 total bytes.
[9316] 2022/05/02 15:21:13.107479 [INF] 123.108.38.134:4333 - lid:2696 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
Since without cross account this doesn’t seems to be happening it doesn’t seems like a network limitation or some sort. So is there way to debug this further? without cross account setup we will end up sending multiple GBs of data across the silos and we did this cross account import to restrict the traffic flow from silo to hub and not the other way around.
I have posted full leaf node config and accounts config in previous issue I have created.
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 18 (10 by maintainers)
We have released 2.8.2, so please upgrade to that as well. We will take a deeper look at the issues above, and thanks for the updates!
@vividvilla The “good” news, is that there is no deadlock, but servers are really busy doing this kind of operations:
or
or that…
Which means processing service imports across different accounts and either decoding or encoding client JSON information or having to set headers to the messages… so nothing “wrong” per-se, just I guess this is becoming a bottle neck.
At this point, let’s see if @derekcollison or @matthiashanel (out sick currently) can have a look at the topology and see if that is required to get what you want to achieve. If so, then we will have to see how that processing can be improved. If not, maybe some changes in the topology will not lead to this performance issue…