nats-server: jetstream could not pull message after nats-server restart

i was testing jetstream on nats-server v2.3.2. one sender and one receiver program are running for quite a long time.

this is what my stream look like :

	_, err = js.AddStream(&nats.StreamConfig{
		Name:      streamName,
		Subjects:  []string{streamSubjects},
		Storage:   nats.FileStorage,
		Replicas:  3,
		Retention: nats.WorkQueuePolicy,
		Discard:   nats.DiscardNew,
		MaxMsgs:   -1,
		MaxAge:    time.Hour * 24 * 365,
	})

this is how i create the consumer:

	if _, err := js.AddConsumer(streamName, &nats.ConsumerConfig{
		Durable:       durableName,
		DeliverPolicy: nats.DeliverAllPolicy,
		AckPolicy:     nats.AckExplicitPolicy,
		ReplayPolicy:  nats.ReplayInstantPolicy,
		FilterSubject: subjectName,
		AckWait:       time.Second * 30,
		MaxDeliver:    -1,
		MaxAckPending: 1000,
	}); err != nil && !strings.Contains(err.Error(), "already in use") {
		log.Println("AddConsumer fail")
		return
	}

this is what the subscriber look like:

	sub, err := js.PullSubscribe("ORDERS.created", durableName, nats.Bind("ORDERS", durableName))
	if err != nil {
		fmt.Println(" PullSubscribe:", err)
		return
	}
       msgs, err := sub.Fetch(1000, nats.MaxWait(10*time.Second))

when i restart my nats-server cluster nodes(upgrade to nats-server 2.3.3), the consumer can no longer pull messages even if i restart my consumer program. the Fetch call just return : “nats: timeout”, but i’m sure there are lots of message in the working queue. only if i delete the consumer by calling js.DeleteConsumer(streamName, durableName), recreate it, my program can resume fetching messages. actually, every time i restart nats-server nodes, my consumer program encouter the same problem.

there is another issue, after i restart nats-server nodes, restart my program, it sometimes reports : “PullSubscribe: nats: JetStream system temporarily unavailable”

I expect nats-server nodes restarting action not impacting jetstream clients fetching messages.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 50 (29 by maintainers)

Commits related to this issue

Most upvoted comments

hi, i just download nats-server main branch, build binary, do the node restart test.
Go client v1.12.0 nats-server platform: windows and linux, amd64 everything runs well. the cluster can survive random node restart both on windows and linux. thanks for your work !

a resilient system should consider bad exit like process kill,unexpected power off,a node could exit at any time。 it’s hard to design distributed systems. thanks for your work!

发自我的iPhone

在 2021年9月2日,06:16,Waldemar Quevedo @.***> 写道:

if i close the cmd.exe window by clicking the close button on the top-right,

@carr123 we were not handling this event properly and so this would have become a bad exit that hit another bug from the server when restarting. The behavior when stopping the server this way has been improved in the main branch 😃

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

hi, Wally download it, so i close the download link. now, i reopen it, should be accessible.http://101.200.84.208/natstest.zip

on windows 10, servers are running in CMD window, stopping it just issue a CTRL+C changing the account password means i want to pick up a better password for some account , then restart all servers. all operations are normal. just restarting the servers one time may not reproduce the issue. i test it manually for many times, the issue will finally show up. i mean if you want to see it , repeated manual testing will lead you to it.

@carr123 @derekcollison The bug was introduced only 3 days ago, and again, does not match the experience that unless @carr123 deletes the JS consumer on the server, then restarting the application does not help. (the bug would affect only a running application that reconnects, not an application that is restarted).