Versions of nats-server
and affected client libraries used:
- nats-server 2.7.4
- nats helm chart 0.14.2
OS/Container environment:
Steps or code to reproduce the issue:
- Existing nats 2.5.0 cluster (5 instances) with Jetstream enabled
- Configure
podManagementPolicy: OrderedReady
( https://github.com/nats-io/k8s/tree/main/helm/charts/nats#breaking-change-log ) for old helm chart deployment compatible
- When execute rolling upgrade to 2.7.4 on ordinal 4 and ordinal 3 pod. Ordinal 2, 1, and 0 (those 3 pods still in old version 2.5.0) nats pod process start panic with following error logs and can’t continue to upgrade. NATS cluster can’t form a Jetstream meta Group due to only have two servers alive.
[1652] 2022/03/21 02:49:11.632347 [INF] Listening for route connections on [0.0.0.0:6222](http://0.0.0.0:6222/)
[1652] 2022/03/21 02:49:11.633565 [INF]…:6222 - rid:225 - Route connection created
[1652] 2022/03/21 02:49:11.633658 [INF] …:6222 - rid:226 - Route connection created
[1652] 2022/03/21 02:49:11.633736 [INF] …:6222 - rid:227 - Route connection created
[1652] 2022/03/21 02:49:11.633760 [INF] …:6222 - rid:228 - Route connection created
panic: JetStream Cluster Unknown group entry op type! 12
goroutine 284 [running]:
[github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).applyConsumerEntries(0xc0002d4000), 0xc000445c00, 0xc001ef6940, 0x0, 0x5, 0x1)
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2948](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2948) +0x5e7
[github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000](http://github.com/nats-io/nats-server/server.(*jetStream).monitorConsumer(0xc0002d4000), 0xc000445c00, 0xc000cbb0e0)
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2874](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2874) +0x65f
[github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1()](http://github.com/nats-io/nats-server/server.(*jetStream).processClusterCreateConsumer.func1())
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/jetstream_cluster.go:2691](http://github.com/nats-io/nats-server/server/jetstream_cluster.go:2691) +0x3c
created by [github.com/nats-io/nats-server/server.(*Server).startGoRoutine](http://github.com/nats-io/nats-server/server.(*Server).startGoRoutine)
/home/travis/gopath/src/[github.com/nats-io/nats-server/server/server.go:2867](http://github.com/nats-io/nats-server/server/server.go:2867) +0xc5
@derekcollison thanks for your help! Just to lighten the mood a bit, I think it’s really great that you take the time to try and understand what is most likely something stupid we did on our end. I really like nats-server as a product, and again thanks for your assistance so far. 😃 I agree that the product is complicated, and it is not helping that the way the problem is manifesting itself is very confusing, and in several different ways, 😃
Before I get back to you regarding a meeting I just have to talk to some colleagues on Monday. I’ll get back to you ASAP, so we can try and set up a meeting. I’ll also try one more time to make a local repro of the problem.
Unfortunately, I don’t have the logs from the other servers when they panicked. It’s hard to provoke that particular manifestation of the issue, and every time the dev cluster is down I disrupt the entire team of developers that are using the cluster for their test environments. I try to avoid those disruptions as much as I can. If I had a local repro of the problem it would be much faster/easier to debug it and give you the steps to repro the problem by yourself.
I’ll try to contact you on Monday. Thanks again!
Servers wait for some time to declare a server down. This allows for temporary disruptions etc.
If the server is shutdown gracefully it will signal to the others that it is going offline and that should be reported immediately.
Can you share what is being printed out when the server’s panic? And double checking you are running the current server, 2.9.11 yes?
Unclear at this time, we would need to do a formal triage with Zoom or something to properly triage.
Unclear, please share what is printed out when the others panic.
This is something we have not seen, so we would want to get a support call scheduled and do a Zoom and get access to more of the internals of your system to diagnose.
Agree.
Again agree. On same page, but we need much more information to help out. These are very complicated systems under the covers, even if they appear simple from the outside.
Yes send me an email and I will connect you with the right folks, but let’s get a Zoom calendar meeting scheduled regardless asap. (derek at synadia dot com)
We believe this has been greatly improved in latest versions of the server, 2.9.10 / 2.9.11 and the newest operator.
Feel free to re-open however if needed.