nats-streaming-server: Frequent leader elections / multiple route connections created / cluster node becomes unresponsive

Nats-streaming-server version 0.10.2 running in a customized docker container running in AWS ECS. We managed to recreate the problem on a local workstation that uses Docker with HyperV integration.

We have stability issues running nats streaming in cluster mode. Our cluster consists of 3 docker containers. As soon as a leader is reelected, clients appear to have problems using their nats connections. Some nats streaming server instances even seem to refuse connections for a limited timeframe until either recovering or becoming unresponsive overall. The only current solution is to shut down the container and create a new container whose nats instance rejoins the existing cluster, effectively replacing the frozen nats instance.

We are currently using the following configurations for our cluster, consisting of “a”, “b” and “c”. The configuration shown resembles the workstation test environment. In addition, the test setup consists of one producer that produces up to ~120 messages per 10 seconds in 1 topic. One subscriber listens for messages on that topic at the same time.

# NATS specific configuration
port: 4222
http_port: 8222
cluster {
  listen: 0.0.0.0:6222
  routes: [
    "nats://hostip:6223",
    "nats://hostip:6224"
  ]
}  

# NATS Streaming specific configuration
streaming {
  store: file
  dir: /mnt/natsdata/a/store
  sd: true
  cluster {
    node_id: "a"
    peers: ["b", "c"]
    log_path: /mnt/natsdata/a/log
  }
  store_limits: {
    max_channels: 1024
    max_msgs: 1000000
    max_bytes: 2GB
    max_age: "336h"
  }
}



# NATS specific configuration
port: 4223
http_port: 8223
cluster {
  listen: 0.0.0.0:6223
  routes: [
    "nats://hostip:6222",
    "nats://hostip:6224"
  ]
}

# NATS Streaming specific configuration
streaming {
  store: file
  dir: /mnt/natsdata/b/store
  sd: true
  cluster {
    node_id: "b"
    peers: ["a", "c"]
    log_path: /mnt/natsdata/b/log
  }
  store_limits: {
    max_channels: 1024
    max_msgs: 1000000
    max_bytes: 2GB
    max_age: "336h"
  }
}

# NATS specific configuration
port: 4224
http_port: 8224
cluster {
  listen: 0.0.0.0:6224
  routes: [
    "nats://hostip:6222",
    "nats://hostip:6223"
  ]
}

# NATS Streaming specific configuration
streaming {
  store: file
  dir: /mnt/natsdata/c/store
  sd: true
  cluster {
    node_id: "c"
    peers: ["a", "b"]
    log_path: /mnt/natsdata/c/log
  }
  store_limits: {
    max_channels: 1024
    max_msgs: 1000000
    max_bytes: 2GB
    max_age: "336h"
  }
}

We then use above configurations to build three different docker containers using the following dockerfile:

FROM alpine:3.7

LABEL NATS Streaming Customized

ARG MAIN_PORT=4222
ARG CLUSTER_PORT=6222
ARG HTTP_PORT=8222
ARG CONFIG_FILE

ENV NATSS_VERSION=v0.10.2

EXPOSE ${MAIN_PORT} ${CLUSTER_PORT} ${HTTP_PORT}

RUN apk update && \
    apk add --no-cache ca-certificates openssl

RUN wget -q https://github.com/nats-io/nats-streaming-server/releases/download/${NATSS_VERSION}/nats-streaming-server-${NATSS_VERSION}-linux-amd64.zip -O /tmp/nats-streaming-server.zip && \
    unzip /tmp/nats-streaming-server.zip -d / && \
    mv /nats-streaming-server-${NATSS_VERSION}-linux-amd64 /opt/ && \
    rm /tmp/nats-streaming-server.zip

RUN apk add --no-cache bash

COPY ${CONFIG_FILE} /opt/config.cfg

CMD ["/opt/nats-streaming-server", "-sc", "/opt/config.cfg"]

As follows is log output that made us curious about whether everything is working as intended. The host IP in the output was replaced with hostIP. The three nats instance connect to each other via hostip:6222, hostip:6223 and hostip:6224.

First thing that was striking is the count of leadership elections while running:

[1] 2018/07/18 09:29:20.330066 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:29:20.379751 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:29:21.578353 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:29:21.737896 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:29:23.454058 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:29:23.954671 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:29:23.954688 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:29:23.954693 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:29:24.757142 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:29:25.779429 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:29:26.274014 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:29:26.274028 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:29:26.274033 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:29:27.991717 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:29:30.789739 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:29:31.289910 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:29:31.289938 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:29:31.289944 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:29:31.724181 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:29:32.483502 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:29:33.064888 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:29:33.064901 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:29:33.064904 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:29:33.408147 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:29:34.332365 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:29:34.832570 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:29:34.832658 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:29:34.832679 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:29:35.234934 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:29:36.446047 [INF] STREAM: Deleting raft logs from 20057 to 20058
[1] 2018/07/18 09:29:36.446191 [INF] STREAM: Deletion took 142.6µs

...

[1] 2018/07/18 09:47:25.611127 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:47:27.091716 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:47:27.497898 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:47:27.594223 [INF] STREAM: finished leader promotion actions

...

[1] 2018/07/18 09:03:45.175753 [DBG] STREAM: [Client:781f8190-8a69-11e8-b458-f94730128e42] Added member to durable queue subscription, subject=PRODUCT.ERP, inbox=_INBOX.AFDCFN2DDM38CH7BN5OADD, queue=AQueue Consumer:MyConsumer, subid=4
[1] 2018/07/18 09:03:51.752622 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:03:51.821795 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:03:52.741442 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:03:53.088813 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:03:54.350080 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:03:55.192489 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:03:56.821066 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:03:58.945087 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:03:58.945167 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:03:59.385226 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:03:59.385303 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:03:59.385309 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:03:59.661667 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:04:00.926032 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:04:01.000600 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:04:04.225938 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:04:04.255029 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:04:05.569304 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/07/18 09:04:06.069551 [INF] STREAM: finished leader promotion actions
[1] 2018/07/18 09:04:06.069661 [ERR] STREAM: Error on leadership acquired: leadership lost while committing log
[1] 2018/07/18 09:04:06.069735 [INF] STREAM: server lost leadership, performing leader stepdown actions
[1] 2018/07/18 09:04:06.104415 [INF] STREAM: finished leader stepdown actions
[1] 2018/07/18 09:04:08.302767 [INF] STREAM: Deleting raft logs from 7843 to 7845
[1] 2018/07/18 09:04:08.302988 [INF] STREAM: Deletion took 218.2µs

Second thing is after leader election concludes, lots of routes are created to the same cluster node.

[1] 2018/07/18 09:29:54.425968 [INF] hostip:6223 - rid:13 - Route connection created
[1] 2018/07/18 09:29:55.464313 [INF] hostip:6223 - rid:14 - Route connection created
[1] 2018/07/18 09:29:56.537748 [INF] hostip:6223 - rid:15 - Route connection created
[1] 2018/07/18 09:29:57.657936 [INF] hostip:6223 - rid:16 - Route connection created
[1] 2018/07/18 09:29:57.816937 [INF] hostip:6223 - rid:17 - Route connection created
[1] 2018/07/18 09:29:57.915439 [INF] hostip:6223 - rid:18 - Route connection created
[1] 2018/07/18 09:29:58.072570 [INF] hostip:6223 - rid:19 - Route connection created
[1] 2018/07/18 09:29:58.163819 [INF] hostip:6223 - rid:20 - Route connection created
[1] 2018/07/18 09:29:58.275567 [INF] hostip:6223 - rid:21 - Route connection created
[1] 2018/07/18 09:29:58.454568 [INF] hostip:6223 - rid:22 - Route connection created
[1] 2018/07/18 09:29:58.583304 [INF] hostip:6223 - rid:23 - Route connection created
...
[1] 2018/07/18 09:30:01.380984 [INF] hostip:6223 - rid:111 - Route connection created
[1] 2018/07/18 09:30:01.413447 [INF] hostip:6223 - rid:113 - Route connection created
[1] 2018/07/18 09:30:01.413521 [INF] hostip:6223 - rid:112 - Route connection created
[1] 2018/07/18 09:30:01.436997 [INF] hostip:6223 - rid:114 - Route connection created
[1] 2018/07/18 09:30:01.446385 [INF] hostip:6223 - rid:115 - Route connection created
[1] 2018/07/18 09:30:01.457837 [INF] hostip:6223 - rid:116 - Route connection created
[1] 2018/07/18 09:30:01.458015 [INF] hostip:6223 - rid:117 - Route connection created
[1] 2018/07/18 09:30:01.466721 [INF] hostip:6223 - rid:118 - Route connection created

Can you provide information on whether the behaviour shown in the logs is intended behaviour? If yes, do you have recommendations on how to improve the setup?

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 20 (17 by maintainers)

Most upvoted comments

sorry to dig on an old issue… but I’m still able to reproduce this on 0.11.4, with small messages…

I made a reproducible here if anyone can take a look: https://github.com/caarlos0/nats-streaming-server-cluster/pull/1

on our production workload we are using java on clients and producers and get the same issue… several in a row most of the times…

caarlos0 on Jun 11, 2019

@caarlos0 As explained on Slack, it seems that you are producing very BIG messages (100MB) and NATS in general is not really meant to be used to single message that big (they could be chunked though). With streaming, you add persistence to the mix, and in clustering mode, the message is replicated (actually persisted in RAFT log for replication and streaming channel message log) so not really the best fit. You have observed slow consumer for routes between nodes, which will cause leader election.

kozlovic on Jun 17, 2019

quick question: doesn’t stan.java have the ping settings? (or maybe its just that I’m not being able to find them)

EDIT: found it, its on nats.java it seems…

caarlos0 on Jun 11, 2019

The default max payload in NATS (as you probably know) is 1MB, and in case of NATS Streaming, it is actually a lower payload due to the protobuf overhead. Again, as describe before, big messages are not the best fit with NATS, especially if there are lots of matching subscriptions.

That being said, even with small messages, you have been showing that there are cases where the traffic is enough to cause leader re-election, which is caused by nodes not being able to contact their peers (RAFT heartbeats). The thing I don’t like is that it causes the client PINGs to fail and this is because while the leader steps down, it unsubscribes to the client PINGs requests and since the re-election takes more than 2 seconds, and your test case uses Pings(1, 3), the application is likely to fail before the new leader is elected. You could set it to Pings(1, 10) and likely the applications would not fail with the PINGs error. This is not the root of the issue, the issue is that the RAFT heartbeats get delayed by the replication traffic and unfortunately I am not sure I can solve this problem. You can however increase the RAFT heartbeat timeout to minimize the risk of that re-election. To increase the timeout to 2s, you would set raft_heartbeat_timeout: "2s" for instance.

kozlovic on Jun 17, 2019