nats-server: Leafnodes getting disconnected on high traffic JS cluster with WriteDeadline exceeded error

Context and setup

  • Have a 3 node hub cluster which connects to 4 different leaf nodes where each silo contains single node JS enabled.
  • To avoid flowing data from one silo to another silo node via hub I have created a bridge account with ACL (referred from: https://www.youtube.com/watch?v=0MkS_S7lyHk.

Observation

  • Leaf node connections gets disconnected with WriteDeadline exceeded error.
  • Have increased the deadline from default to 60s but still observing this when traffic flowing through is really high.
  • Pull consumers connecting to streams on this hub cluster timeout while consuming and acks piles up.
  • If leaf nodes are directly connected without cross account setup then this is not observed.

Here is the logs.

[9316] 2022/05/02 15:18:26.664422 [INF] 123.108.38.132:4333 - lid:2693 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
[9316] 2022/05/02 15:18:27.678241 [INF] 123.108.38.133:4333 - lid:2695 - Leafnode connection created for account: NEST_BRIDGE
[9316] 2022/05/02 15:19:16.773775 [INF] 123.108.38.135:4333 - lid:2694 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 23192 chunks of 1519850264 total bytes.
[9316] 2022/05/02 15:19:16.774039 [INF] 123.108.38.135:4333 - lid:2694 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 33680 chunks of 2207071394 total bytes.
[9316] 2022/05/02 15:19:16.774075 [INF] 123.108.38.135:4333 - lid:2694 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
[9316] 2022/05/02 15:19:17.783783 [INF] 123.108.38.134:4333 - lid:2696 - Leafnode connection created for account: NEST_BRIDGE
[9316] 2022/05/02 15:21:01.854899 [INF] 123.108.38.133:4333 - lid:2695 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 24624 chunks of 1613745557 total bytes.
[9316] 2022/05/02 15:21:01.855290 [INF] 123.108.38.133:4333 - lid:2695 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 42582 chunks of 2790522385 total bytes.
[9316] 2022/05/02 15:21:01.855304 [INF] 123.108.38.133:4333 - lid:2695 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE
[9316] 2022/05/02 15:21:02.870848 [INF] 123.108.38.132:4333 - lid:2697 - Leafnode connection created for account: NEST_BRIDGE
[9316] 2022/05/02 15:21:13.107117 [INF] 123.108.38.134:4333 - lid:2696 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 17525 chunks of 1148463352 total bytes.
[9316] 2022/05/02 15:21:13.107466 [INF] 123.108.38.134:4333 - lid:2696 - Slow Consumer Detected: WriteDeadline of 1m0s exceeded with 36860 chunks of 2415455897 total bytes.
[9316] 2022/05/02 15:21:13.107479 [INF] 123.108.38.134:4333 - lid:2696 - Leafnode connection closed: Slow Consumer (Write Deadline) account: NEST_BRIDGE

Since without cross account this doesn’t seems to be happening it doesn’t seems like a network limitation or some sort. So is there way to debug this further? without cross account setup we will end up sending multiple GBs of data across the silos and we did this cross account import to restrict the traffic flow from silo to hub and not the other way around.

I have posted full leaf node config and accounts config in previous issue I have created.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 18 (10 by maintainers)

Most upvoted comments

We have released 2.8.2, so please upgrade to that as well. We will take a deeper look at the issues above, and thanks for the updates!

@vividvilla The “good” news, is that there is no deadlock, but servers are really busy doing this kind of operations:

goroutine 129130 [runnable]:
bytes.(*Buffer).WriteString(0xc0024a9140, {0xc001b36690, 0x2f})
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/bytes/buffer.go:180 +0xfa
net/http.Header.writeSubset(0xa0fe80, {0xb0f280, 0xc0024a9140}, 0x11, 0x0)
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/net/http/header.go:199 +0x484
net/http.Header.write(...)
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/net/http/header.go:87
net/http.Header.Write(...)
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/net/http/header.go:83
github.com/nats-io/nats-server/v2/server.(*client).setHeader(0xc00112c000, {0xa4432d, 0x11}, {0xc001b36690, 0x2f}, {0xc0008ddbc8, 0x2c0, 0x438})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3795 +0x225
github.com/nats-io/nats-server/v2/server.(*client).processServiceImport(0xc00112c000, 0xc0001a0ea0, 0xc00022a240, {0xc0008ddbc8, 0x2c0, 0x438})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3965 +0xa26
github.com/nats-io/nats-server/v2/server.(*Account).addServiceImportSub.func1(0xc0010c0770, 0x40cc74, 0x7f16e79f4f18, {0xc0000dc305, 0xc0010c07a8}, {0x697851, 0x101010000980220}, {0xc0008ddbc8, 0x2c0, 0x438})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1935 +0x32
github.com/nats-io/nats-server/v2/server.(*client).deliverMsg(0xc00112c000, 0xc00002f200, 0x5, {0xc0006236a0, 0x41482f, 0x10}, {0x0, 0x7f16c10c2fff, 0x0}, {0xc00112d480, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3194 +0xc4d
github.com/nats-io/nats-server/v2/server.(*client).processMsgResults(0xc00112c000, 0xc00022a240, 0xc000d9be30, {0xc0008ddbc8, 0x2c0, 0x438}, {0x0, 0x0, 0x0}, {0xc0006236a0, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4189 +0xa75
github.com/nats-io/nats-server/v2/server.(*client).processServiceImport(0xc00112c000, 0xc0001a0bd0, 0xc00022a240, {0xc0008ddbc8, 0x2c0, 0x438})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4005 +0xed1
github.com/nats-io/nats-server/v2/server.(*Account).addServiceImportSub.func1(0x0, 0x939557, 0xc0000d221e, {0x19, 0x1}, {0x0, 0x2b2}, {0xc0008ddbc8, 0x2c0, 0x438})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1935 +0x32
github.com/nats-io/nats-server/v2/server.(*client).deliverMsg(0xc00112c000, 0xc00002ee40, 0x40, {0xc00171d740, 0xc00046e4c0, 0x40}, {0x0, 0x995d40, 0x0}, {0xc00112d480, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3194 +0xc4d
github.com/nats-io/nats-server/v2/server.(*client).processMsgResults(0xc00112c000, 0x0, 0xc000d9be30, {0xc0008ddbc8, 0x2c0, 0x438}, {0x0, 0x0, 0x0}, {0xc0008ddbb5, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4189 +0xa75
github.com/nats-io/nats-server/v2/server.(*client).processInboundLeafMsg(0xc00112c000, {0xc0008ddbc8, 0xc0004a68c0, 0x438})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/leafnode.go:2231 +0x3a5
github.com/nats-io/nats-server/v2/server.(*client).processInboundMsg(0xc00112c000, {0xc0008ddbc8, 0x11, 0xc8d})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3516 +0x51
github.com/nats-io/nats-server/v2/server.(*client).parse(0xc00112c000, {0xc0008dc000, 0x2000, 0x4000})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:497 +0x246a
github.com/nats-io/nats-server/v2/server.(*client).readLoop(0xc00112c000, {0x0, 0x0, 0x0})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1229 +0xe1f
github.com/nats-io/nats-server/v2/server.(*Server).createLeafNode.func1()
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/leafnode.go:932 +0x29
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3013 +0x87

or

goroutine 129789 [runnable]:
encoding/json.Marshal({0x9a2700, 0xc0012cc000})
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/encoding/json/encode.go:158 +0x127
github.com/nats-io/nats-server/v2/server.(*client).processServiceImport(0xc00022f300, 0xc0001a0bd0, 0xc00022a240, {0xc0008211a4, 0x225, 0x2e5c})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3964 +0x9b9
github.com/nats-io/nats-server/v2/server.(*Account).addServiceImportSub.func1(0xc00109d230, 0x0, 0xc00109d240, {0x41642f, 0x764a09}, {0x403651, 0xc00109d248}, {0xc0008211a4, 0x225, 0x2e5c})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1935 +0x32
github.com/nats-io/nats-server/v2/server.(*client).deliverMsg(0xc00022f300, 0xc00002ee40, 0xa80, {0xc00171d7c0, 0xffffffffffffffff, 0x40}, {0x0, 0x41642f, 0x0}, {0xc000230781, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3194 +0xc4d
github.com/nats-io/nats-server/v2/server.(*client).processMsgResults(0xc00022f300, 0x0, 0xc000d9be30, {0xc0008211a4, 0x225, 0x2e5c}, {0x0, 0x0, 0x0}, {0xc000821194, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4189 +0xa75
github.com/nats-io/nats-server/v2/server.(*client).processInboundLeafMsg(0xc00022f300, {0xc0008211a4, 0xc0006a2c40, 0x2e5c})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/leafnode.go:2231 +0x3a5
github.com/nats-io/nats-server/v2/server.(*client).processInboundMsg(0xc00022f300, {0xc0008211a4, 0xe, 0x2e6c})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3516 +0x51
github.com/nats-io/nats-server/v2/server.(*client).parse(0xc00022f300, {0xc000820000, 0x13c9, 0x1000})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:497 +0x246a
github.com/nats-io/nats-server/v2/server.(*client).readLoop(0xc00022f300, {0x0, 0x0, 0x0})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1229 +0xe1f
github.com/nats-io/nats-server/v2/server.(*Server).createLeafNode.func1()
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/leafnode.go:932 +0x29
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3013 +0x87

or that…

goroutine 127118 [runnable]:
sync.(*Pool).pin(0xe897e0)
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/sync/pool.go:195 +0x65
sync.(*Pool).Get(0xe897e0)
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/sync/pool.go:128 +0x25
encoding/json.newEncodeState()
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/encoding/json/encode.go:305 +0x25
encoding/json.Marshal({0x9a2700, 0xc00118a3c0})
	/home/travis/.gimme/versions/go1.17.9.linux.amd64/src/encoding/json/encode.go:159 +0x27
github.com/nats-io/nats-server/v2/server.(*client).processServiceImport(0xc000230c80, 0xc0001a0ea0, 0xc00022a240, {0xc001476c8c, 0x264, 0x7374})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3964 +0x9b9
github.com/nats-io/nats-server/v2/server.(*Account).addServiceImportSub.func1(0xc0010c0770, 0x40cc74, 0x7f16e79f4f18, {0xc0000dc305, 0xc0010c07a8}, {0x697851, 0x101010000980220}, {0xc001476c8c, 0x264, 0x7374})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1935 +0x32
github.com/nats-io/nats-server/v2/server.(*client).deliverMsg(0xc000230c80, 0xc00002f200, 0x5, {0xc001aa8bc0, 0x17, 0x10}, {0x0, 0x0, 0x0}, {0xc000232100, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3194 +0xc4d
github.com/nats-io/nats-server/v2/server.(*client).processMsgResults(0xc000230c80, 0x400, 0xc000d9be30, {0xc001476c8c, 0x264, 0x7374}, {0x0, 0x0, 0x0}, {0xc001aa8bc0, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4189 +0xa75
github.com/nats-io/nats-server/v2/server.(*client).processServiceImport(0xc000230c80, 0xc0001a0bd0, 0xc00022a240, {0xc001476c8c, 0x264, 0x7374})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4005 +0xed1
github.com/nats-io/nats-server/v2/server.(*Account).addServiceImportSub.func1(0xc0010c1248, 0x939557, 0xc001c46a5e, {0x19, 0x22}, {0xc00049e300, 0xc0010c1248}, {0xc001476c8c, 0x264, 0x7374})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/accounts.go:1935 +0x32
github.com/nats-io/nats-server/v2/server.(*client).deliverMsg(0xc000230c80, 0xc00002ee40, 0x14, {0xc0014def40, 0xffffffffffffffff, 0x40}, {0x0, 0x995d40, 0x0}, {0xc000232100, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3194 +0xc4d
github.com/nats-io/nats-server/v2/server.(*client).processMsgResults(0xc000230c80, 0x0, 0xc000d9be30, {0xc001476c8c, 0x264, 0x7374}, {0x0, 0x0, 0x0}, {0xc001476c79, ...}, ...)
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:4189 +0xa75
github.com/nats-io/nats-server/v2/server.(*client).processInboundLeafMsg(0xc000230c80, {0xc001476c8c, 0xc0001de2c0, 0x7374})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/leafnode.go:2231 +0x3a5
github.com/nats-io/nats-server/v2/server.(*client).processInboundMsg(0xc000230c80, {0xc001476c8c, 0x11, 0x4e3331})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:3516 +0x51
github.com/nats-io/nats-server/v2/server.(*client).parse(0xc000230c80, {0xc001476000, 0x1f29, 0x1000})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/parser.go:497 +0x246a
github.com/nats-io/nats-server/v2/server.(*client).readLoop(0xc000230c80, {0x0, 0x0, 0x0})
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/client.go:1229 +0xe1f
github.com/nats-io/nats-server/v2/server.(*Server).createLeafNode.func1()
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/leafnode.go:932 +0x29
created by github.com/nats-io/nats-server/v2/server.(*Server).startGoRoutine
	/home/travis/gopath/src/github.com/nats-io/nats-server/server/server.go:3013 +0x87

Which means processing service imports across different accounts and either decoding or encoding client JSON information or having to set headers to the messages… so nothing “wrong” per-se, just I guess this is becoming a bottle neck.

At this point, let’s see if @derekcollison or @matthiashanel (out sick currently) can have a look at the topology and see if that is required to get what you want to achieve. If so, then we will have to see how that processing can be improved. If not, maybe some changes in the topology will not lead to this performance issue…