nats-server: Leaf node got stuck after downgrading from 2.9.16 to 2.9.14
I rolled back the nats-server version on my servers (1 - hub, 18 - leaf nodes) from 2.9.16 to 2.9.14 and noticed the following: 10 leaf nodes and hub worked, 8 leaf nodes returned {“status”:“error”,“error”:“failed to be ready for connections after 1ms: server”} in response to /healthz. In the log at that moment there were such messages:
[107693] 2023/04/24 23:50:51.856361 [INF] Starting nats-server
[107693] 2023/04/24 23:50:51.856424 [INF] Version: 2.9.14
[107693] 2023/04/24 23:50:51.856435 [INF] Git: [74ae59a]
[107693] 2023/04/24 23:50:51.856437 [INF] Cluster: 10.193.104.173_4222
[107693] 2023/04/24 23:50:51.856439 [INF] Name: 10.193.104.173_4222
[107693] 2023/04/24 23:50:51.856443 [INF] Node: h07ZsChV
[107693] 2023/04/24 23:50:51.856445 [INF] ID: NDB2U5ITIBPIJKXWETVDXYWA2I4SBKSNLPNL62WYRYMNUPF6WUACLB3K
[107693] 2023/04/24 23:50:51.856531 [INF] Using configuration file: /opt/nats-server/conf/nats.conf
[107693] 2023/04/24 23:50:51.857180 [INF] Starting http monitor on 0.0.0.0:8222
[107693] 2023/04/24 23:50:51.857232 [INF] Starting JetStream
[107693] 2023/04/24 23:50:51.857351 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[107693] 2023/04/24 23:50:51.857355 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[107693] 2023/04/24 23:50:51.857358 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[107693] 2023/04/24 23:50:51.857360 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[107693] 2023/04/24 23:50:51.857361 [INF]
[107693] 2023/04/24 23:50:51.857363 [INF] https://docs.nats.io/jetstream
[107693] 2023/04/24 23:50:51.857365 [INF]
[107693] 2023/04/24 23:50:51.857366 [INF] ---------------- JETSTREAM ----------------
[107693] 2023/04/24 23:50:51.857370 [INF] Max Memory: 47.18 GB
[107693] 2023/04/24 23:50:51.857372 [INF] Max Storage: 186.26 GB
[107693] 2023/04/24 23:50:51.857374 [INF] Store Directory: "/storage/nats-server/data/jetstream"
[107693] 2023/04/24 23:50:51.857375 [INF] Domain: leaf10
[107693] 2023/04/24 23:50:51.857377 [INF] -------------------------------------------
[107693] 2023/04/24 23:50:51.857445 [INF] Standalone server started in clustered mode do not support extending domains
[107693] 2023/04/24 23:50:51.857450 [INF] Manually disable standalone mode by setting the JetStream Option "extension_hint: will_extend"
[107693] 2023/04/24 23:50:51.965891 [INF] Starting restore for stream 'USERS > CART'
[107693] 2023/04/24 23:50:51.966585 [INF] Restored 2 messages for stream 'USERS > CART'
[107693] 2023/04/24 23:50:51.966647 [INF] Starting restore for stream 'USERS > EXPORT_L10'
[107693] 2023/04/24 23:50:52.444906 [INF] Restored 2,660,964 messages for stream 'USERS > EXPORT_L10'
[107693] 2023/04/24 23:50:52.445017 [INF] Starting restore for stream 'USERS > IMPORT'
[107693] 2023/04/24 23:51:00.792338 [INF] Restored 33,094,465 messages for stream 'USERS > IMPORT'
[107693] 2023/04/24 23:51:00.792451 [INF] Starting restore for stream 'USERS > LOGS'
[107693] 2023/04/24 23:51:04.801854 [INF] Restored 12,677,807 messages for stream 'USERS > LOGS'
[107693] 2023/04/24 23:51:04.801952 [INF] Starting restore for stream 'USERS > REFBOOKS'
[107693] 2023/04/24 23:51:04.808080 [INF] Restored 3,787 messages for stream 'USERS > REFBOOKS'
[107693] 2023/04/24 23:51:04.808159 [INF] Starting restore for stream 'USERS > STORE'
[107693] 2023/04/24 23:51:05.006718 [INF] Restored 1,509,464 messages for stream 'USERS > STORE'
[107693] 2023/04/24 23:51:05.006859 [INF] Starting restore for stream 'USERS > STOREHOUSE'
[107693] 2023/04/24 23:51:05.007118 [INF] Restored 0 messages for stream 'USERS > STOREHOUSE'
[107693] 2023/04/24 23:51:05.007172 [INF] Starting restore for stream 'USERS > WAREHOUSE'
[107693] 2023/04/24 23:51:05.051818 [INF] Restored 678,236 messages for stream 'USERS > WAREHOUSE'
[107693] 2023/04/24 23:51:05.053164 [INF] Recovering 2091 consumers for stream - 'USERS > IMPORT'
I waited 12 hours - the situation has not changed. After that, I stopped the nats-server on leaf node, turned on trace and received the following messages in the log:
[74324] 2023/04/25 12:26:30.823054 [INF] Starting nats-server
[74324] 2023/04/25 12:26:30.823114 [INF] Version: 2.9.14
[74324] 2023/04/25 12:26:30.823117 [INF] Git: [74ae59a]
[74324] 2023/04/25 12:26:30.823119 [INF] Cluster: 10.193.104.173_4222
[74324] 2023/04/25 12:26:30.823125 [INF] Name: 10.193.104.173_4222
[74324] 2023/04/25 12:26:30.823129 [INF] Node: h07ZsChV
[74324] 2023/04/25 12:26:30.823133 [INF] ID: NCNNWD2Y2WRGRX2HXMVLEVSGUTGTGQFWQWIFOKQLBAF5OIFTTTL6EN7K
[74324] 2023/04/25 12:26:30.823186 [INF] Using configuration file: /opt/nats-server/conf/nats.conf
[74324] 2023/04/25 12:26:30.823662 [INF] Starting http monitor on 0.0.0.0:8222
[74324] 2023/04/25 12:26:30.823737 [INF] Starting JetStream
[74324] 2023/04/25 12:26:30.823901 [INF] _ ___ _____ ___ _____ ___ ___ _ __ __
[74324] 2023/04/25 12:26:30.823905 [INF] _ | | __|_ _/ __|_ _| _ \ __| /_\ | \/ |
[74324] 2023/04/25 12:26:30.823908 [INF] | || | _| | | \__ \ | | | / _| / _ \| |\/| |
[74324] 2023/04/25 12:26:30.823910 [INF] \__/|___| |_| |___/ |_| |_|_\___/_/ \_\_| |_|
[74324] 2023/04/25 12:26:30.823912 [INF]
[74324] 2023/04/25 12:26:30.823914 [INF] https://docs.nats.io/jetstream
[74324] 2023/04/25 12:26:30.823915 [INF]
[74324] 2023/04/25 12:26:30.823917 [INF] ---------------- JETSTREAM ----------------
[74324] 2023/04/25 12:26:30.823921 [INF] Max Memory: 47.18 GB
[74324] 2023/04/25 12:26:30.823923 [INF] Max Storage: 186.26 GB
[74324] 2023/04/25 12:26:30.823925 [INF] Store Directory: "/storage/nats-server/data/jetstream"
[74324] 2023/04/25 12:26:30.823927 [INF] Domain: leaf10
[74324] 2023/04/25 12:26:30.823929 [INF] -------------------------------------------
[74324] 2023/04/25 12:26:30.824039 [INF] Standalone server started in clustered mode do not support extending domains
[74324] 2023/04/25 12:26:30.824045 [INF] Manually disable standalone mode by setting the JetStream Option "extension_hint: will_extend"
[74324] 2023/04/25 12:26:30.903656 [INF] Starting restore for stream 'USERS > CART'
[74324] 2023/04/25 12:26:30.904254 [INF] Restored 2 messages for stream 'USERS > CART'
[74324] 2023/04/25 12:26:30.904367 [INF] Starting restore for stream 'USERS > EXPORT_L10'
[74324] 2023/04/25 12:26:31.272141 [INF] Restored 2,279,592 messages for stream 'USERS > EXPORT_L10'
[74324] 2023/04/25 12:26:31.272255 [INF] Starting restore for stream 'USERS > IMPORT'
[74324] 2023/04/25 12:26:36.170175 [INF] Restored 29,896,437 messages for stream 'USERS > IMPORT'
[74324] 2023/04/25 12:26:36.170339 [INF] Starting restore for stream 'USERS > LOGS'
[74324] 2023/04/25 12:26:37.552399 [INF] Restored 7,718,886 messages for stream 'USERS > LOGS'
[74324] 2023/04/25 12:26:37.552514 [INF] Starting restore for stream 'USERS > REFBOOKS'
[74324] 2023/04/25 12:26:37.552982 [INF] Restored 3,410 messages for stream 'USERS > REFBOOKS'
[74324] 2023/04/25 12:26:37.553056 [INF] Starting restore for stream 'USERS > STORE'
[74324] 2023/04/25 12:26:37.732422 [INF] Restored 1,403,162 messages for stream 'USERS > STORE'
[74324] 2023/04/25 12:26:37.732546 [INF] Starting restore for stream 'USERS > STOREHOUSE'
[74324] 2023/04/25 12:26:37.732862 [INF] Restored 0 messages for stream 'USERS > STOREHOUSE'
[74324] 2023/04/25 12:26:37.732958 [INF] Starting restore for stream 'USERS > WAREHOUSE'
[74324] 2023/04/25 12:26:37.757799 [INF] Restored 626,225 messages for stream 'USERS > WAREHOUSE'
[74324] 2023/04/25 12:26:37.759187 [INF] Recovering 2091 consumers for stream - 'USERS > IMPORT'
[74324] 2023/04/25 12:26:38.237341 [WRN] Healthcheck failed: "failed to be ready for connections after 1ms: server"
[74324] 2023/04/25 12:26:39.249236 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_644521681302097737297600 45]
[74324] 2023/04/25 12:26:39.260640 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_650001681302704707175300 46]
[74324] 2023/04/25 12:26:39.270687 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_661961681302713507233800 47]
[74324] 2023/04/25 12:26:39.278153 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_684841681303311840164200 48]
[74324] 2023/04/25 12:26:39.285803 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.jsm_stream_pager_705201681305473333515300 49]
[74324] 2023/04/25 12:26:39.289626 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_101000_IMPORT 50]
[74324] 2023/04/25 12:26:39.291589 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_101721_IMPORT 51]
[74324] 2023/04/25 12:26:39.293930 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_101722_IMPORT 52]
[74324] 2023/04/25 12:26:39.718037 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_102961_IMPORT 53]
[74324] 2023/04/25 12:26:40.163375 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_103070_IMPORT 54]
[74324] 2023/04/25 12:26:40.167427 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_103132_IMPORT 55]
<skip>
[74324] 2023/04/25 12:26:41.170817 [TRC] JETSTREAM - <-> [DELSUB 1]
[74324] 2023/04/25 12:26:41.181263 [TRC] JETSTREAM - <-> [DELSUB 2]
<skip>
[74324] 2023/04/25 13:22:35.236510 [TRC] JETSTREAM - <-> [DELSUB 1340]
[74324] 2023/04/25 13:22:35.416691 [TRC] JETSTREAM - <-> [DELSUB 841]
[74324] 2023/04/25 13:22:35.421053 [TRC] JETSTREAM - <<- [SUB $JSC.CI.USERS.IMPORT.replicator_422482_IMPORT 977]
[74324] 2023/04/25 13:22:35.426930 [TRC] JETSTREAM - <-> [DELSUB 842]
[74324] 2023/04/25 13:22:38.272068 [WRN] Healthcheck failed: "failed to be ready for connections after 1ms: server"
[74324] 2023/04/25 13:22:40.237434 [TRC] JETSTREAM - <-> [DELSUB 1341]
[74324] 2023/04/25 13:22:40.247839 [TRC] JETSTREAM - <-> [DELSUB 1342]
I waited another 12 hours - similar messages continued to arrive, but leafnode was still unavailable. After that, I started 2.9.16 - the server became available in a minute.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 19 (11 by maintainers)
Have you tried adding the following to leafnode configs under the jetstream block?