nats-server: Can't keep Jetstream cluster working when execute rolling upgrade from 2.5.0 to 2.7.4

Versions of nats-server and affected client libraries used:

  • nats-server 2.7.4
  • nats helm chart 0.14.2

OS/Container environment:

  • K8s

Steps or code to reproduce the issue:

  1. Existing nats 2.5.0 cluster (5 instances) with Jetstream enabled
  2. Configure podManagementPolicy: OrderedReady ( https://github.com/nats-io/k8s/tree/main/helm/charts/nats#breaking-change-log ) for old helm chart deployment compatible
  3. When execute rolling upgrade to 2.7.4 on ordinal 4 and ordinal 3 pod. Ordinal 2, 1, and 0 (those 3 pods still in old version 2.5.0) nats pod process start panic with following error logs and can’t continue to upgrade. NATS cluster can’t form a Jetstream meta Group due to only have two servers alive.
[1652] 2022/03/21 02:49:11.632347 [INF] Listening for route connections on [0.0.0.0:6222](http://0.0.0.0:6222/)
[1652] 2022/03/21 02:49:11.633565 [INF]…:6222 - rid:225 - Route connection created
[1652] 2022/03/21 02:49:11.633658 [INF] …:6222 - rid:226 - Route connection created
[1652] 2022/03/21 02:49:11.633736 [INF] …:6222 - rid:227 - Route connection created
[1652] 2022/03/21 02:49:11.633760 [INF] …:6222 - rid:228 - Route connection created
panic: JetStream Cluster Unknown group entry op type! 12
goroutine 284 [running]:
[github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000), 0xc000445c00, 0xc001ef6940, 0x0, 0x5, 0x1)
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2948](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2948) +0x5e7
[github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000), 0xc000445c00, 0xc000cbb0e0)
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2874](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2874) +0x65f
[github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1()](http://github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1())
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2691](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2691) +0x3c
created by [github.com/nats-io/nats-server/server.(*Server).startGoRoutine](http://github.com/nats-io/nats-server/server.(*Server).startGoRoutine)
        /home/travis/gopath/src/[github.com/nats-io/nats-server/server/server.go:2867](http://github.com/nats-io/nats-server/server/server.go:2867) +0xc5

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 26 (8 by maintainers)

Most upvoted comments

@derekcollison thanks for your help! Just to lighten the mood a bit, I think it’s really great that you take the time to try and understand what is most likely something stupid we did on our end. I really like nats-server as a product, and again thanks for your assistance so far. 😃 I agree that the product is complicated, and it is not helping that the way the problem is manifesting itself is very confusing, and in several different ways, 😃

Before I get back to you regarding a meeting I just have to talk to some colleagues on Monday. I’ll get back to you ASAP, so we can try and set up a meeting. I’ll also try one more time to make a local repro of the problem.

Unfortunately, I don’t have the logs from the other servers when they panicked. It’s hard to provoke that particular manifestation of the issue, and every time the dev cluster is down I disrupt the entire team of developers that are using the cluster for their test environments. I try to avoid those disruptions as much as I can. If I had a local repro of the problem it would be much faster/easier to debug it and give you the steps to repro the problem by yourself.

I’ll try to contact you on Monday. Thanks again!

Yes, it doesn’t work. Only thing that works is retrying starting the nats-server process. Sometimes it works.

Another thing I noticed is when running nats server report jetstream after I close one of the nodes. The node is not listed as being offline, it is just listed as not being current. Why doesn’t raft realize that the server is offline?

Servers wait for some time to declare a server down. This allows for temporary disruptions etc.

If the server is shutdown gracefully it will signal to the others that it is going offline and that should be reported immediately.

Another thing I saw was that the other nodes in the cluster sometimes panic after I try to restart the node I stopped myself. That is very bad. As trying to recover a semi-functional cluster can bring it down entirely! 😦

Can you share what is being printed out when the server’s panic? And double checking you are running the current server, 2.9.11 yes?

Questions I worry and wonder about:

  • Why does it sometimes work to retry starting the process?

Unclear at this time, we would need to do a formal triage with Zoom or something to properly triage.

  • Why does it only happen after a shutdown of a node in the cluster, and not while normal operation?

Unclear, please share what is printed out when the others panic.

  • If some state is indeed corrupt. How does that happen? Note it happens consistently in our dev cluster even when I totally repave the cluster (i.e. stop all nodes, delete data folders, restart all nodes), I’m unsure if our accept and production clusters experience the same problems.

This is something we have not seen, so we would want to get a support call scheduled and do a Zoom and get access to more of the internals of your system to diagnose.

  • It seems wrong that a consumer should be able to corrupt state in a way that prevents the cluster from recovering.

Agree.

  • It seems very bad that a startup of a node in a cluster is able to panic the other servers in said cluster.

Again agree. On same page, but we need much more information to help out. These are very complicated systems under the covers, even if they appear simple from the outside.

Is it possible to pay for direct support from you guys? I really feel I’m in over my head. As our production system is currently relying on our nats cluster I’m very worried, that I will make things more unstable as I try to fix this issue. I realize it could be some stupid thing I did, or we have a bad host somehow, but I don’t really know what to look for.

Yes send me an email and I will connect you with the right folks, but let’s get a Zoom calendar meeting scheduled regardless asap. (derek at synadia dot com)

We believe this has been greatly improved in latest versions of the server, 2.9.10 / 2.9.11 and the newest operator.

Feel free to re-open however if needed.