nats-server: Leaf node got stuck after downgrading from 2.9.16 to 2.9.14

I rolled back the nats-server version on my servers (1 - hub, 18 - leaf nodes) from 2.9.16 to 2.9.14 and noticed the following: 10 leaf nodes and hub worked, 8 leaf nodes returned {“status”:“error”,“error”:“failed to be ready for connections after 1ms: server”} in response to /healthz. In the log at that moment there were such messages:

[107693] 2023/04/24 23:50:51.856361 [INF] Starting nats-server
[107693] 2023/04/24 23:50:51.856424 [INF]   Version:  2.9.14
[107693] 2023/04/24 23:50:51.856435 [INF]   Git:      [74ae59a]
[107693] 2023/04/24 23:50:51.856437 [INF]   Cluster:  10.193.104.173_4222
[107693] 2023/04/24 23:50:51.856439 [INF]   Name:     10.193.104.173_4222
[107693] 2023/04/24 23:50:51.856443 [INF]   Node:     h07ZsChV
[107693] 2023/04/24 23:50:51.856445 [INF]   ID:       NDB2U5ITIBPIJKXWETVDXYWA2I4SBKSNLPNL62WYRYMNUPF6WUACLB3K
[107693] 2023/04/24 23:50:51.856531 [INF] Using configuration file: /opt/nats-server/conf/nats.conf
[107693] 2023/04/24 23:50:51.857180 [INF] Starting http monitor on 0.0.0.0:8222
[107693] 2023/04/24 23:50:51.857232 [INF] Starting JetStream
[107693] 2023/04/24 23:50:51.857351 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[107693] 2023/04/24 23:50:51.857355 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[107693] 2023/04/24 23:50:51.857358 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[107693] 2023/04/24 23:50:51.857360 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[107693] 2023/04/24 23:50:51.857361 [INF] 
[107693] 2023/04/24 23:50:51.857363 [INF]          https://docs.nats.io/jetstream
[107693] 2023/04/24 23:50:51.857365 [INF] 
[107693] 2023/04/24 23:50:51.857366 [INF] ---------------- JETSTREAM ----------------
[107693] 2023/04/24 23:50:51.857370 [INF]   Max Memory:      47.18 GB
[107693] 2023/04/24 23:50:51.857372 [INF]   Max Storage:     186.26 GB
[107693] 2023/04/24 23:50:51.857374 [INF]   Store Directory: "/storage/nats-server/data/jetstream"
[107693] 2023/04/24 23:50:51.857375 [INF]   Domain:          leaf10
[107693] 2023/04/24 23:50:51.857377 [INF] -------------------------------------------
[107693] 2023/04/24 23:50:51.857445 [INF] Standalone server started in clustered mode do not support extending domains
[107693] 2023/04/24 23:50:51.857450 [INF] Manually disable standalone mode by setting the JetStream Option "extension_hint: will_extend"
[107693] 2023/04/24 23:50:51.965891 [INF]   Starting restore for stream 'USERS > CART'
[107693] 2023/04/24 23:50:51.966585 [INF]   Restored 2 messages for stream 'USERS > CART'
[107693] 2023/04/24 23:50:51.966647 [INF]   Starting restore for stream 'USERS > EXPORT_L10'
[107693] 2023/04/24 23:50:52.444906 [INF]   Restored 2,660,964 messages for stream 'USERS > EXPORT_L10'
[107693] 2023/04/24 23:50:52.445017 [INF]   Starting restore for stream 'USERS > IMPORT'
[107693] 2023/04/24 23:51:00.792338 [INF]   Restored 33,094,465 messages for stream 'USERS > IMPORT'
[107693] 2023/04/24 23:51:00.792451 [INF]   Starting restore for stream 'USERS > LOGS'
[107693] 2023/04/24 23:51:04.801854 [INF]   Restored 12,677,807 messages for stream 'USERS > LOGS'
[107693] 2023/04/24 23:51:04.801952 [INF]   Starting restore for stream 'USERS > REFBOOKS'
[107693] 2023/04/24 23:51:04.808080 [INF]   Restored 3,787 messages for stream 'USERS > REFBOOKS'
[107693] 2023/04/24 23:51:04.808159 [INF]   Starting restore for stream 'USERS > STORE'
[107693] 2023/04/24 23:51:05.006718 [INF]   Restored 1,509,464 messages for stream 'USERS > STORE'
[107693] 2023/04/24 23:51:05.006859 [INF]   Starting restore for stream 'USERS > STOREHOUSE'
[107693] 2023/04/24 23:51:05.007118 [INF]   Restored 0 messages for stream 'USERS > STOREHOUSE'
[107693] 2023/04/24 23:51:05.007172 [INF]   Starting restore for stream 'USERS > WAREHOUSE'
[107693] 2023/04/24 23:51:05.051818 [INF]   Restored 678,236 messages for stream 'USERS > WAREHOUSE'
[107693] 2023/04/24 23:51:05.053164 [INF]   Recovering 2091 consumers for stream - 'USERS > IMPORT'

I waited 12 hours - the situation has not changed. After that, I stopped the nats-server on leaf node, turned on trace and received the following messages in the log:

[74324] 2023/04/25 12:26:30.823054 [INF] Starting nats-server
[74324] 2023/04/25 12:26:30.823114 [INF]   Version:  2.9.14
[74324] 2023/04/25 12:26:30.823117 [INF]   Git:      [74ae59a]
[74324] 2023/04/25 12:26:30.823119 [INF]   Cluster:  10.193.104.173_4222
[74324] 2023/04/25 12:26:30.823125 [INF]   Name:     10.193.104.173_4222
[74324] 2023/04/25 12:26:30.823129 [INF]   Node:     h07ZsChV
[74324] 2023/04/25 12:26:30.823133 [INF]   ID:       NCNNWD2Y2WRGRX2HXMVLEVSGUTGTGQFWQWIFOKQLBAF5OIFTTTL6EN7K
[74324] 2023/04/25 12:26:30.823186 [INF] Using configuration file: /opt/nats-server/conf/nats.conf
[74324] 2023/04/25 12:26:30.823662 [INF] Starting http monitor on 0.0.0.0:8222
[74324] 2023/04/25 12:26:30.823737 [INF] Starting JetStream
[74324] 2023/04/25 12:26:30.823901 [INF]     _ ___ _____ ___ _____ ___ ___   _   __  __
[74324] 2023/04/25 12:26:30.823905 [INF]  _ | | __|_   _/ __|_   _| _ \ __| /_\ |  \/  |
[74324] 2023/04/25 12:26:30.823908 [INF] | || | _|  | | \__ \ | | |   / _| / _ \| |\/| |
[74324] 2023/04/25 12:26:30.823910 [INF]  \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_|  |_|
[74324] 2023/04/25 12:26:30.823912 [INF] 
[74324] 2023/04/25 12:26:30.823914 [INF]          https://docs.nats.io/jetstream
[74324] 2023/04/25 12:26:30.823915 [INF] 
[74324] 2023/04/25 12:26:30.823917 [INF] ---------------- JETSTREAM ----------------
[74324] 2023/04/25 12:26:30.823921 [INF]   Max Memory:      47.18 GB
[74324] 2023/04/25 12:26:30.823923 [INF]   Max Storage:     186.26 GB
[74324] 2023/04/25 12:26:30.823925 [INF]   Store Directory: "/storage/nats-server/data/jetstream"
[74324] 2023/04/25 12:26:30.823927 [INF]   Domain:          leaf10
[74324] 2023/04/25 12:26:30.823929 [INF] -------------------------------------------
[74324] 2023/04/25 12:26:30.824039 [INF] Standalone server started in clustered mode do not support extending domains
[74324] 2023/04/25 12:26:30.824045 [INF] Manually disable standalone mode by setting the JetStream Option "extension_hint: will_extend"
[74324] 2023/04/25 12:26:30.903656 [INF]   Starting restore for stream 'USERS > CART'
[74324] 2023/04/25 12:26:30.904254 [INF]   Restored 2 messages for stream 'USERS > CART'
[74324] 2023/04/25 12:26:30.904367 [INF]   Starting restore for stream 'USERS > EXPORT_L10'
[74324] 2023/04/25 12:26:31.272141 [INF]   Restored 2,279,592 messages for stream 'USERS > EXPORT_L10'
[74324] 2023/04/25 12:26:31.272255 [INF]   Starting restore for stream 'USERS > IMPORT'
[74324] 2023/04/25 12:26:36.170175 [INF]   Restored 29,896,437 messages for stream 'USERS > IMPORT'
[74324] 2023/04/25 12:26:36.170339 [INF]   Starting restore for stream 'USERS > LOGS'
[74324] 2023/04/25 12:26:37.552399 [INF]   Restored 7,718,886 messages for stream 'USERS > LOGS'
[74324] 2023/04/25 12:26:37.552514 [INF]   Starting restore for stream 'USERS > REFBOOKS'
[74324] 2023/04/25 12:26:37.552982 [INF]   Restored 3,410 messages for stream 'USERS > REFBOOKS'
[74324] 2023/04/25 12:26:37.553056 [INF]   Starting restore for stream 'USERS > STORE'
[74324] 2023/04/25 12:26:37.732422 [INF]   Restored 1,403,162 messages for stream 'USERS > STORE'
[74324] 2023/04/25 12:26:37.732546 [INF]   Starting restore for stream 'USERS > STOREHOUSE'
[74324] 2023/04/25 12:26:37.732862 [INF]   Restored 0 messages for stream 'USERS > STOREHOUSE'
[74324] 2023/04/25 12:26:37.732958 [INF]   Starting restore for stream 'USERS > WAREHOUSE'
[74324] 2023/04/25 12:26:37.757799 [INF]   Restored 626,225 messages for stream 'USERS > WAREHOUSE'
[74324] 2023/04/25 12:26:37.759187 [INF]   Recovering 2091 consumers for stream - 'USERS > IMPORT'
[74324] 2023/04/25 12:26:38.237341 [WRN] Healthcheck failed: "failed to be ready for connections after 1ms: server"
[74324] 2023/04/25 12:26:39.249236 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_644521681302097737297600  45]
[74324] 2023/04/25 12:26:39.260640 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_650001681302704707175300  46]
[74324] 2023/04/25 12:26:39.270687 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_661961681302713507233800  47]
[74324] 2023/04/25 12:26:39.278153 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_684841681303311840164200  48]
[74324] 2023/04/25 12:26:39.285803 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_705201681305473333515300  49]
[74324] 2023/04/25 12:26:39.289626 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_101000_IMPORT  50]
[74324] 2023/04/25 12:26:39.291589 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_101721_IMPORT  51]
[74324] 2023/04/25 12:26:39.293930 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_101722_IMPORT  52]
[74324] 2023/04/25 12:26:39.718037 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_102961_IMPORT  53]
[74324] 2023/04/25 12:26:40.163375 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_103070_IMPORT  54]
[74324] 2023/04/25 12:26:40.167427 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_103132_IMPORT  55]
<skip>
[74324] 2023/04/25 12:26:41.170817 [TRC] JETSTREAM - <-> [DELSUB 1]
[74324] 2023/04/25 12:26:41.181263 [TRC] JETSTREAM - <-> [DELSUB 2]
<skip>
[74324] 2023/04/25 13:22:35.236510 [TRC] JETSTREAM - <-> [DELSUB 1340]
[74324] 2023/04/25 13:22:35.416691 [TRC] JETSTREAM - <-> [DELSUB 841]
[74324] 2023/04/25 13:22:35.421053 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_422482_IMPORT  977]
[74324] 2023/04/25 13:22:35.426930 [TRC] JETSTREAM - <-> [DELSUB 842]
[74324] 2023/04/25 13:22:38.272068 [WRN] Healthcheck failed: "failed to be ready for connections after 1ms: server"
[74324] 2023/04/25 13:22:40.237434 [TRC] JETSTREAM - <-> [DELSUB 1341]
[74324] 2023/04/25 13:22:40.247839 [TRC] JETSTREAM - <-> [DELSUB 1342]

I waited another 12 hours - similar messages continued to arrive, but leafnode was still unavailable. After that, I started 2.9.16 - the server became available in a minute.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (11 by maintainers)

Most upvoted comments

Have you tried adding the following to leafnode configs under the jetstream block?

extension_hint: will_extend