nats-server: Consumers stops receiving messages
Defect
Versions of nats-server
and affected client libraries used:
Nats server version
[83] 2021/09/04 18:51:12.239432 [INF] Starting nats-server
[83] 2021/09/04 18:51:12.239488 [INF] Version: 2.4.0
[83] 2021/09/04 18:51:12.239494 [INF] Git: [219a7c98]
[83] 2021/09/04 18:51:12.239496 [DBG] Go build: go1.16.7
[83] 2021/09/04 18:51:12.239517 [INF] Name: NBVE7O7DMRAZ63STC7Z644KHF5HJ6QQUGLZVGDIKEG32CFL2J6O2456M
[83] 2021/09/04 18:51:12.239533 [INF] ID: NBVE7O7DMRAZ63STC7Z644KHF5HJ6QQUGLZVGDIKEG32CFL2J6O2456M
[83] 2021/09/04 18:51:12.239605 [DBG] Created system account: "$SYS"
Go client version: v1.12.0
OS/Container environment:
GKE Kubernetes. Running nats js HA cluster. Deployed via nats helm chart.
Steps or code to reproduce the issue:
Stream configuration:
apiVersion: jetstream.nats.io/v1beta1
kind: Stream
metadata:
name: agent
spec:
name: agent
subjects: ["data.*"]
storage: file
maxAge: 1h
replicas: 3
retention: interest
There are two consumers to this stream. Each runs as queue subscriber in two services with 2 pod replicas. Note that I don’t care if message is not processed, this is why ack none is set.
// 2 pods for service A.
js.QueueSubscribe(
"data.received",
"service1_queue",
func(msg *nats.Msg) {},
nats.DeliverNew(),
nats.AckNone(),
)
// 2 pods for service B.
s.js.QueueSubscribe(
"data.received",
"service2_queue",
func(msg *nats.Msg) {},
nats.DeliverNew(),
nats.AckNone(),
)
Expected result:
Consumer receives messages.
Actual result:
Stream stats after few days:
agent │ File │ 3 │ 28,258 │ 18 MiB │ 0 │ 84 │ nats-js-0, nats-js-1*, nats-js-2
Consumers stats:
service1_queue │ Push │ None │ 0.00s │ 0 │ 0 │ 0 │ 60,756 │ nats-js-0, nats-js-1*, nats-js-2
service2_queue │ Push │ None │ 0.00s │ 0 │ 0 │ 8,193 / 28% │ 60,843 │ nats-js-0, nats-js-1*, nats-js-2
- Non of the nats server pods contains errors indicating any problem.
- Unprocessed messages count for second consumer stays the same and doesn’t decrease.
- The only fix which helped is after I changed second consumer raft leader with
nats consumer cluster step-down
. But after some time problem still comes back. - There are active connections to the server. Checked with
nats server report connections
.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 38 (19 by maintainers)
Commits related to this issue
- Fix for issue #2488. When we triggered a filestore msg block compact we were not properly dealing with interior deletes. Subsequent lookups past the skipped messages would cause an error and stop del... — committed to nats-io/nats-server by derekcollison 3 years ago
- Merge pull request #2505 from nats-io/issue-2488-2 [FIXED] #2488 — committed to nats-io/nats-server by derekcollison 3 years ago
On my case it was not the issue. I tested with the latest Go client v1.12.1 both at publisher and subscriber, server is running v2.4.0 and the issue was still happening.
and stream info
One thing I saw is that the number of Messages in the stream starts to increase (now it’s just 352 but increments from time to time) and the consumer stops consuming them, leaving the queue subscribers starving for new messages. The messages are not delivered at all.
go client 1.12.1
/usr/local/etc/nats-server-post.conf
It happened tonight but a reset has occurred and server has been transferred to another virtual machine of the provider.
Then it started but without getting messages.
Maybe it happens when the jetstream file gets corrupted.
I hope it helps
@derekcollison sorry again but if you have a fast consumer (remove the sleep in the subscribe handker) the error is triggered again.
Versions of
nats-server
and affected client libraries used:nats-server
v2.5.0nats-client
v1.12.1Steps or code to reproduce the issue:
This time I have a fast consumer connecting to a stream which already has ~50k messages and when the consumers start (15 this time), the receiving rate is high (fast consumer(no sleep) + a lot of messages to process) but after a while the receive rate is down to zero.
Publish some messages to the stream without consuming them and connect the consumers after it.
config
Expected result:
Constant or nearly constant receive rate
Actual result:
On this case the receivers started to receive messages again after a while but for most of my tests it never recovered
Is it better to create a new issue?