nats-server: Consumer stopped working after errPartialCache (nats-server oom-killed)
Defect
Make sure that these boxes are checked before submitting your issue – thank you!
- Included
nats-server -DV
output - Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)
Versions of nats-server
and affected client libraries used:
# nats-server -DV
[92] 2021/12/06 15:16:05.235349 [INF] Starting nats-server
[92] 2021/12/06 15:16:05.235397 [INF] Version: 2.6.6
[92] 2021/12/06 15:16:05.235401 [INF] Git: [878afad]
[92] 2021/12/06 15:16:05.235406 [DBG] Go build: go1.16.10
[92] 2021/12/06 15:16:05.235416 [INF] Name: NASX72BQAFBIH4QBLZ36RADTPKSO6LCKRDEAS37XRJ7SYZ53RYYOFHHS
[92] 2021/12/06 15:16:05.235436 [INF] ID: NASX72BQAFBIH4QBLZ36RADTPKSO6LCKRDEAS37XRJ7SYZ53RYYOFHHS
[92] 2021/12/06 15:16:05.235457 [DBG] Created system account: "$SYS"
Image: nats:2.6.6-alpine
Limits:
cpu: 200m
memory: 256Mi
Requests:
cpu: 200m
memory: 256Mi
go library:
github.com/nats-io/nats.go v1.13.1-0.20211018182449-f2416a8b1483
OS/Container environment:
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
CONTAINER-RUNTIME
cri-o://1.21.4
Steps or code to reproduce the issue:
- Start nats cluster (3 replicas) with Jetstream enabled. JS Config:
jetstream {
max_mem: 64Mi
store_dir: /data
max_file:10Gi
}
- Start to push messages into stream. Stream config:
Configuration:
Subjects: widget-request-collector
Acknowledgements: true
Retention: File - WorkQueue
Replicas: 3
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Maximum Messages: unlimited
Maximum Bytes: 1.9 GiB
Maximum Age: 1d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
- Shutdown one of the nats nodes for a while and rate limit consumer (or shutdown consumer) for collecting messages in file storage.
- Wait until storage reached it’s maximum capacity (1.9G).
- Bring up nats server. (Do not bring up consumer)
Expected result:
Outdated node should become current.
Actual result:
Outdated node tries to become current, gets messages from stream leader, but reached memory limit and killed by OOM. It restarts again, and again killed by OOM.
Cluster Information:
Name: nats
Leader: promo-widget-collector-event-nats-2
Replica: promo-widget-collector-event-nats-1, outdated, OFFLINE, seen 2m8s ago, 13,634 operations behind
Replica: promo-widget-collector-event-nats-0, current, seen 0.00s ago
State:
Messages: 2,695,412
Bytes: 1.9 GiB
FirstSeq: 3,957,219 @ 2021-12-06T14:04:00 UTC
LastSeq: 6,652,630 @ 2021-12-06T15:09:36 UTC
Active Consumers: 1
Crashed pod info:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 06 Dec 2021 14:30:26 +0000
Finished: Mon, 06 Dec 2021 14:31:08 +0000
Ready: False
Restart Count: 3
Is it possible to configure memory limits for nats-server to prevent memory overeating?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 51 (27 by maintainers)
Commits related to this issue
- Stabilize filstore to eliminate sporadic errPartialCache errors under certain situations. Related to #2732 The filestore would release a msgBlock lock while trying to load a cache block if it thought... — committed to nats-io/nats-server by derekcollison 3 years ago
- Stabilize filstore to eliminate sporadic errPartialCache errors under certain situations. Related to #2732 The filestore would release a msgBlock lock while trying to load a cache block if it thought... — committed to nats-io/nats-server by derekcollison 3 years ago
@derekcollison Done. Mail from abi@
Could i have recommended config for jet stream on production? I have around 10k events per second. How many recourses? I need a durable queue. How to make 2 or more consumers?
We have merged the PR into main and cut a nightly build for synadia/nats-server.
If you could please test and followup here with any issues. Happy to re-open as needed.
I have made some progress and believe I know what is going on. The issue is very subtle and needs hand tuning to timing and locks within the server to see it get triggered, but this is encouraging to me.
I am hoping to have a PR at some point tomorrow. Wanted to give folks an update.
@derekcollison I’m referencing my JS experience. I mentioned that I’m still using STAN for portions of my software just to provide additional info, if it matters somehow. NATS Streaming 0.23.2 have embedded 2.6.5, so I enabled JS on it to simplify migration from STAN while it’s still supported.
It was JS Consumer that was dead adter ‘partical cache’ ERR, so probably I have the same issue as OP.