nats-server: [Jetstream] NATS chart deployment seems to break quorum "randomly" after about a week

Defect

As a user of OpenFaaS, NATS Jetstream is part of our stack. OpenFaaS handles async requests via JetStream.

We have noticed that after some time has passed, roughly once every week or two, quorum appears to break, and pods created by the NATS statefulset are not being brought up and down. PVC is stable, as well.

Load does not appear to be particularly high during this time, either, roughly ~1000 items added and removed from the queue in the span of about 10-15 minutes, so ~2000 mutations overall.

Eventually, we see this log start to repeat from 1 or more of the pods:

[26] 2023/04/12 14:15:07.186881 [WRN] JetStream cluster consumer '$G > faas-request > faas-workers' has NO quorum, stalled.
[26] 2023/04/12 14:15:07.248099 [WRN] JetStream cluster stream '$G > faas-request' has NO quorum, stalled

Quorum breaks, and this stream effectively “jams”. We then just go in and restart the offending pod and things seem to come back online, no data appears to be lost.

Make sure that these boxes are checked before submitting your issue – thank you!

Included nats-server -DV output
Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Maybe I’m not creating this issue in the correct repo, because nats-server -dv doesn’t appear to be the command I would expect to use here. I have server logs, but they’re in datadog. There’s ways to get them to the NATS team, though. Feel free to move this issue or guide me to what you need.

As for the MCVE, this issue is pretty hard to reliably reproduce in a timely manner, because quorum just kinda eventually breaks.

Versions of `nats-server` and affected client libraries used:

nats:2.9.15-alpine
not sure about the client libraries; OpenFaaS would be able to answer that.

OS/Container environment:

AWS EKS cluster @ k8s:1.26

Steps or code to reproduce the issue:

Expected result:

If no pods are moved from the statefulset, I would generally not expect quorum to break.

Actual result:

Quorum seems to break with seemingly no infrastructure instability and relatively low operations.
Quorum doesn’t seem to fix itself; manual intervention is needed, by terminating (what appears to be) the offending pod.

About this issue

Original URL
State: closed
Created a year ago
Comments: 32 (13 by maintainers)

Most upvoted comments

After working on producing replication steps for this, I identified a “bug” in OpenFaaS (really a feature that is needed). Once I did a workaround for this issue, I have been unsuccessful on reproducing this issue.

I’m closing this issue for now, because at the moment it appears that this issue currently lives on the OF side.

kevin-lindsay-1 on May 1, 2023

Quorum should not break, so something else was going on. If you see this happen again ping us on Slack or here and let’s jump on a Zoom/GH call.

I also removed istio sidecars from everything in the OF namespace, and verified that everything continues to work as expected, in order to remove additional moving parts.

kevin-lindsay-1 on Apr 21, 2023

not sure about the client libraries; OpenFaaS would be able to answer that.

@kevin-lindsay-1 - from go.mod - github.com/nats-io/nats.go v1.24.0 - assuming you’re on the latest build of the queue worker / gateway.

alexellis on Apr 21, 2023