nats-server: Consumer stopped working after errPartialCache (nats-server oom-killed)

Defect

Make sure that these boxes are checked before submitting your issue – thank you!

Included nats-server -DV output
Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

Versions of `nats-server` and affected client libraries used:

# nats-server -DV
[92] 2021/12/06 15:16:05.235349 [INF] Starting nats-server
[92] 2021/12/06 15:16:05.235397 [INF]   Version:  2.6.6
[92] 2021/12/06 15:16:05.235401 [INF]   Git:      [878afad]
[92] 2021/12/06 15:16:05.235406 [DBG]   Go build: go1.16.10
[92] 2021/12/06 15:16:05.235416 [INF]   Name:     NASX72BQAFBIH4QBLZ36RADTPKSO6LCKRDEAS37XRJ7SYZ53RYYOFHHS
[92] 2021/12/06 15:16:05.235436 [INF]   ID:       NASX72BQAFBIH4QBLZ36RADTPKSO6LCKRDEAS37XRJ7SYZ53RYYOFHHS
[92] 2021/12/06 15:16:05.235457 [DBG] Created system account: "$SYS"

Image:         nats:2.6.6-alpine
    Limits:
      cpu:     200m
      memory:  256Mi
    Requests:
      cpu:      200m
      memory:   256Mi

go library:

github.com/nats-io/nats.go v1.13.1-0.20211018182449-f2416a8b1483

OS/Container environment:

Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

CONTAINER-RUNTIME
cri-o://1.21.4

Steps or code to reproduce the issue:

Start nats cluster (3 replicas) with Jetstream enabled. JS Config:

jetstream {
  max_mem: 64Mi
  store_dir: /data

  max_file:10Gi
}

Start to push messages into stream. Stream config:

Configuration:

             Subjects: widget-request-collector
     Acknowledgements: true
            Retention: File - WorkQueue
             Replicas: 3
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false
     Maximum Messages: unlimited
        Maximum Bytes: 1.9 GiB
          Maximum Age: 1d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Shutdown one of the nats nodes for a while and rate limit consumer (or shutdown consumer) for collecting messages in file storage.
Wait until storage reached it’s maximum capacity (1.9G).
Bring up nats server. (Do not bring up consumer)

Expected result:

Outdated node should become current.

Actual result:

Outdated node tries to become current, gets messages from stream leader, but reached memory limit and killed by OOM. It restarts again, and again killed by OOM.

Cluster Information:

                 Name: nats
               Leader: promo-widget-collector-event-nats-2
              Replica: promo-widget-collector-event-nats-1, outdated, OFFLINE, seen 2m8s ago, 13,634 operations behind
              Replica: promo-widget-collector-event-nats-0, current, seen 0.00s ago

State:

             Messages: 2,695,412
                Bytes: 1.9 GiB
             FirstSeq: 3,957,219 @ 2021-12-06T14:04:00 UTC
              LastSeq: 6,652,630 @ 2021-12-06T15:09:36 UTC
     Active Consumers: 1

Crashed pod info:

    State:          Waiting                                                                                                                                                                                                                                                                                                                                                                                                              
      Reason:       CrashLoopBackOff                                                                                                                                                                                                                                                                                                                                                                                                     
    Last State:     Terminated                                                                                                                                                                                                                                                                                                                                                                                                           
      Reason:       OOMKilled                                                                                                                                                                                                                                                                                                                                                                                                            
      Exit Code:    137                                                                                                                                                                                                                                                                                                                                                                                                                  
      Started:      Mon, 06 Dec 2021 14:30:26 +0000                                                                                                                                                                                                                                                                                                                                                                                      
      Finished:     Mon, 06 Dec 2021 14:31:08 +0000                                                                                                                                                                                                                                                                                                                                                                                      
    Ready:          False                                                                                                                                                                                                                                                                                                                                                                                                                
    Restart Count:  3

Is it possible to configure memory limits for nats-server to prevent memory overeating?

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 51 (27 by maintainers)

Commits related to this issue

Stabilize filstore to eliminate sporadic errPartialCache errors under certain situations. Related to #2732 The filestore would release a msgBlock lock while trying to load a cache block if it thought... — committed to nats-io/nats-server by derekcollison 3 years ago
Stabilize filstore to eliminate sporadic errPartialCache errors under certain situations. Related to #2732 The filestore would release a msgBlock lock while trying to load a cache block if it thought... — committed to nats-io/nats-server by derekcollison 3 years ago

Most upvoted comments

@derekcollison Done. Mail from abi@

abishai on Dec 23, 2021

Could i have recommended config for jet stream on production? I have around 10k events per second. How many recourses? I need a durable queue. How to make 2 or more consumers?

raoptimus on Dec 7, 2021

We have merged the PR into main and cut a nightly build for synadia/nats-server.

If you could please test and followup here with any issues. Happy to re-open as needed.

derekcollison on Dec 27, 2021

I have made some progress and believe I know what is going on. The issue is very subtle and needs hand tuning to timing and locks within the server to see it get triggered, but this is encouraging to me.

I am hoping to have a PR at some point tomorrow. Wanted to give folks an update.

derekcollison on Dec 26, 2021

@derekcollison I’m referencing my JS experience. I mentioned that I’m still using STAN for portions of my software just to provide additional info, if it matters somehow. NATS Streaming 0.23.2 have embedded 2.6.5, so I enabled JS on it to simplify migration from STAN while it’s still supported.

It was JS Consumer that was dead adter ‘partical cache’ ERR, so probably I have the same issue as OP.

abishai on Dec 21, 2021