nats-server: [Jetstream] NATS chart deployment seems to break quorum "randomly" after about a week
Defect
As a user of OpenFaaS, NATS Jetstream is part of our stack. OpenFaaS handles async requests via JetStream.
We have noticed that after some time has passed, roughly once every week or two, quorum appears to break, and pods created by the NATS statefulset are not being brought up and down. PVC is stable, as well.
Load does not appear to be particularly high during this time, either, roughly ~1000 items added and removed from the queue in the span of about 10-15 minutes, so ~2000 mutations overall.
Eventually, we see this log start to repeat from 1 or more of the pods:
[26] 2023/04/12 14:15:07.186881 [WRN] JetStream cluster consumer '$G > faas-request > faas-workers' has NO quorum, stalled.
[26] 2023/04/12 14:15:07.248099 [WRN] JetStream cluster stream '$G > faas-request' has NO quorum, stalled
Quorum breaks, and this stream effectively “jams”. We then just go in and restart the offending pod and things seem to come back online, no data appears to be lost.
Make sure that these boxes are checked before submitting your issue – thank you!
- Included
nats-server -DV
output - Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
Maybe I’m not creating this issue in the correct repo, because nats-server -dv
doesn’t appear to be the command I would expect to use here. I have server logs, but they’re in datadog. There’s ways to get them to the NATS team, though. Feel free to move this issue or guide me to what you need.
As for the MCVE, this issue is pretty hard to reliably reproduce in a timely manner, because quorum just kinda eventually breaks.
Versions of nats-server
and affected client libraries used:
nats:2.9.15-alpine
- not sure about the client libraries; OpenFaaS would be able to answer that.
OS/Container environment:
AWS EKS cluster @ k8s:1.26
Steps or code to reproduce the issue:
Expected result:
If no pods are moved from the statefulset, I would generally not expect quorum to break.
Actual result:
- Quorum seems to break with seemingly no infrastructure instability and relatively low operations.
- Quorum doesn’t seem to fix itself; manual intervention is needed, by terminating (what appears to be) the offending pod.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 32 (13 by maintainers)
After working on producing replication steps for this, I identified a “bug” in OpenFaaS (really a feature that is needed). Once I did a workaround for this issue, I have been unsuccessful on reproducing this issue.
I’m closing this issue for now, because at the moment it appears that this issue currently lives on the OF side.
I also removed istio sidecars from everything in the OF namespace, and verified that everything continues to work as expected, in order to remove additional moving parts.
@kevin-lindsay-1 - from go.mod -
github.com/nats-io/nats.go v1.24.0
- assuming you’re on the latest build of the queue worker / gateway.