nats-streaming-server: Segmentation violation in cluster mode and latest docker container
Hi guys!
I’ve just tested latest version of NATS Streaming in cluster mode with 3 nodes and got panic: runtime error: invalid memory address or nil pointer dereference
.
My steps:
- Create 3 data directories (
srv1
,srv2
,srv3
) - Put simple
nats.conf
in each data directory (ports end with 1, 2 and 3 respectively):
port: 4231
http_port: 8231
store: "file"
dir: "/data"
cluster {
host: '10.10.0.100'
port: 5231
bootstrap: true
raft_logging: true
routes = [
nats-route://10.10.0.100:5232
nats-route://10.10.0.100:5233
]
}
- Start 3 docker containers:
docker container run -it --name nats1 -v ${PWD}/srv1:/data --network=host nats-streaming -c /data/nats.conf
- Do some simple tests to be sure everything is working.
- Start 3 consumers (just commented out printing every messages):
stan-sub -s nats://10.10.0.100:4231 -id 1 --qgroup qgtest -all test
stan-sub -s nats://10.10.0.100:4232 -id 2 --qgroup qgtest -all test
stan-sub -s nats://10.10.0.100:4233 -id 3 --qgroup qgtest -all test
- Start data producer:
stan-bench -s nats://10.10.0.100:4231 -np 1 -ns 0 -n 10000000 -ms 1000 -a test
After several minutes second and third containers started printing errors:
...
[1] 2018/02/14 20:32:31.100452 [ERR] STREAM: [Client:benchmark-pub-1] Error replicating message for subject "_STAN.pub.test-cluster.test": timed out enqueuing operation
[1] 2018/02/14 20:32:31.100464 [ERR] STREAM: [Client:benchmark-pub-97] Error replicating message for subject "_STAN.pub.test-cluster.test": timed out enqueuing operation
[1] 2018/02/14 20:32:31.100475 [ERR] STREAM: [Client:benchmark-pub-94] Error replicating message for subject "_STAN.pub.test-cluster.test": timed out enqueuing operation
[1] 2018/02/14 20:32:31.100485 [ERR] STREAM: [Client:benchmark-pub-97] Error replicating message for subject "_STAN.pub.test-cluster.test": timed out enqueuing operation
...
and then:
[1] 2018/02/14 20:32:27.598057 [ERR] STREAM: [Client:benchmark-pub-85] Error replicating message for subject "_STAN.pub.test-cluster.test": timed out enqueuing operation
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x87e359]
goroutine 70 [running]:
github.com/nats-io/nats-streaming-server/stores.(*FileMsgStore).removeFirstMsg(0xc42009a340, 0x0, 0xc42ec9af00)
/go/src/github.com/nats-io/nats-streaming-server/stores/filestore.go:2647 +0x69
github.com/nats-io/nats-streaming-server/stores.(*FileMsgStore).enforceLimits(0xc42009a340, 0x1e0001, 0xc4531b12c0, 0xc447f7b801)
/go/src/github.com/nats-io/nats-streaming-server/stores/filestore.go:2589 +0xa2
github.com/nats-io/nats-streaming-server/stores.(*FileMsgStore).Store(0xc42009a340, 0xc4531b12c0, 0x0, 0x0, 0x0)
/go/src/github.com/nats-io/nats-streaming-server/stores/filestore.go:2482 +0xb99
github.com/nats-io/nats-streaming-server/server.(*StanServer).Apply(0xc4201ea000, 0xc450f849a0, 0xdfe320, 0x3)
/go/src/github.com/nats-io/nats-streaming-server/server/server.go:4498 +0x176
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc44b13b1e0)
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/fsm.go:57 +0x17b
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc4200e1340)
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/fsm.go:120 +0x31e
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*Raft).(github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.runFSM)-fm()
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/api.go:506 +0x2a
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc4200e1340, 0xc420180500)
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/state.go:144 +0x66
After restarting failed containers:
$ docker container start -ai nats2
[1] 2018/02/14 20:34:30.879042 [INF] STREAM: Starting nats-streaming-server[test-cluster] version 0.8.0-beta
[1] 2018/02/14 20:34:30.879082 [INF] STREAM: ServerID: PIWf5lutLWcVp9fYVr2WFY
[1] 2018/02/14 20:34:30.879087 [INF] STREAM: Go version: go1.9.3
[1] 2018/02/14 20:34:30.879323 [INF] Starting nats-server version 1.0.4
[1] 2018/02/14 20:34:30.879554 [INF] Starting http monitor on 0.0.0.0:8232
[1] 2018/02/14 20:34:30.879671 [INF] Listening for client connections on 0.0.0.0:4232
[1] 2018/02/14 20:34:30.879685 [INF] Server is ready
[1] 2018/02/14 20:34:30.880552 [INF] Listening for route connections on 10.10.0.100:5232
[1] 2018/02/14 20:34:30.906597 [INF] STREAM: Recovering the state...
[1] 2018/02/14 20:34:31.373113 [INF] STREAM: Recovered 1 channel(s)
[1] 2018/02/14 20:34:33.298825 [INF] STREAM: raft: Initial configuration (index=1): [{Suffrage:Voter ID:ib2LAFTTccvyusMmcYgM9D Address:test-cluster.ib2LAFTTccvyusMmcYgM9D.stan}]
[1] 2018/02/14 20:34:33.298939 [INF] STREAM: raft: Node at test-cluster.ib2LAFTTccvyusMmcYgM9D.stan [Follower] entering Follower state (Leader: "")
[1] 2018/02/14 20:34:33.300337 [INF] STREAM: Message store is RAFT_FILE
[1] 2018/02/14 20:34:33.300415 [INF] STREAM: ---------- Store Limits ----------
[1] 2018/02/14 20:34:33.300425 [INF] STREAM: Channels: 100 *
[1] 2018/02/14 20:34:33.300431 [INF] STREAM: --------- Channels Limits --------
[1] 2018/02/14 20:34:33.300436 [INF] STREAM: Subscriptions: 1000 *
[1] 2018/02/14 20:34:33.300443 [INF] STREAM: Messages : 1000000 *
[1] 2018/02/14 20:34:33.300449 [INF] STREAM: Bytes : 976.56 MB *
[1] 2018/02/14 20:34:33.300456 [INF] STREAM: Age : unlimited *
[1] 2018/02/14 20:34:33.300464 [INF] STREAM: ----------------------------------
[1] 2018/02/14 20:34:35.169644 [INF] STREAM: raft: Heartbeat timeout from "" reached, starting election
[1] 2018/02/14 20:34:35.169694 [INF] STREAM: raft: Node at test-cluster.ib2LAFTTccvyusMmcYgM9D.stan [Candidate] entering Candidate state in term 3
[1] 2018/02/14 20:34:35.169961 [INF] STREAM: raft: Election won. Tally: 1
[1] 2018/02/14 20:34:35.169977 [INF] STREAM: raft: Node at test-cluster.ib2LAFTTccvyusMmcYgM9D.stan [Leader] entering Leader state
[1] 2018/02/14 20:34:35.170079 [INF] STREAM: server became leader, performing leader promotion actions
[1] 2018/02/14 20:34:38.276052 [INF] STREAM: WARNING: Reached limits for store "test" (msgs=963311/1000000 bytes=976.56 MB/976.56 MB), dropping old messages to make room for new ones
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x87e359]
goroutine 92 [running]:
github.com/nats-io/nats-streaming-server/stores.(*FileMsgStore).removeFirstMsg(0xc42023a000, 0x0, 0xc4203e1600)
/go/src/github.com/nats-io/nats-streaming-server/stores/filestore.go:2647 +0x69
github.com/nats-io/nats-streaming-server/stores.(*FileMsgStore).enforceLimits(0xc42023a000, 0x1e0001, 0xc42172aa50, 0xc422016a01)
/go/src/github.com/nats-io/nats-streaming-server/stores/filestore.go:2589 +0xa2
github.com/nats-io/nats-streaming-server/stores.(*FileMsgStore).Store(0xc42023a000, 0xc42172aa50, 0x0, 0x0, 0x0)
/go/src/github.com/nats-io/nats-streaming-server/stores/filestore.go:2482 +0xb99
github.com/nats-io/nats-streaming-server/server.(*StanServer).Apply(0xc4201ee000, 0xc4203354a0, 0xdfe320, 0xbe99453790618e89)
/go/src/github.com/nats-io/nats-streaming-server/server/server.go:4498 +0x176
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*Raft).runFSM.func1(0xc4202ec2d0)
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/fsm.go:57 +0x17b
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*Raft).runFSM(0xc4200d0000)
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/fsm.go:120 +0x31e
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*Raft).(github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.runFSM)-fm()
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/api.go:506 +0x2a
github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*raftState).goFunc.func1(0xc4200d0000, 0xc4202ec150)
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/state.go:146 +0x53
created by github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft.(*raftState).goFunc
/go/src/github.com/nats-io/nats-streaming-server/vendor/github.com/hashicorp/raft/state.go:144 +0x66
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 21 (13 by maintainers)
Commits related to this issue
- [FIXED] Server in cluster mode may Panic due to raft timeout - We were using a timeout to enqueue raft.Apply() commands, while we should really not (this could leave to an inconsistent state betw... — committed to nats-io/nats-streaming-server by kozlovic 6 years ago
- [FIXED] Server in cluster mode may Panic due to raft timeout - We were using a timeout to enqueue raft.Apply() commands, while we should really not (this could leave to an inconsistent state betw... — committed to nats-io/nats-streaming-server by kozlovic 6 years ago
- [FIXED] Server in cluster mode may Panic due to raft timeout - We were using a timeout to enqueue raft.Apply() commands, while we should really not (this could leave to an inconsistent state betw... — committed to nats-io/nats-streaming-server by kozlovic 6 years ago
- [FIXED] Server in cluster mode may Panic due to raft timeout - We were using a timeout to enqueue raft.Apply() commands, while we should really not (this could leave to an inconsistent state betw... — committed to nats-io/nats-streaming-server by kozlovic 6 years ago
@sega-yarkin Just an update that I am still working on this issue. I was not able to reproduce the issue, but was able to find another possible panic, which I have a PR for that.
I could add defensive code to prevent the panic that you have reported, I think, but I would prefer to be able to understand how this can happen in the first place. I believed that the fix in #477 should have solved the problem. I will continue the investigation.
Thanks for your patience!
@sega-yarkin Just an update that I have made some progress and may have a PR sometime tomorrow.