etcd: 3.3.7 panic: send on closed channel

hi, i got a panic on 3.3.7. happend today and 10 days ago version:

 ./etcd --version
etcd Version: 3.3.7
Git SHA: 56536de55
Go Version: go1.9.6
Go OS/Arch: linux/amd64

cmdline:

/opt/etcd/etcd --name xxx --discovery-srv xxx --initial-advertise-peer-urls=https://xxx:2380 --initial-cluster-token xxx --initial-cluster-state new --advertise-client-urls=https://xxxx --listen-client-urls=https://xxx:2379 --listen-peer-urls=https://xxx:2380 --cert-file=/etc/certificates/etcd/server.crt --key-file=/etc/certificates/etcd/server.key --trusted-ca-file=/etc/certificates/etcd/ca.crt --client-cert-auth --peer-cert-file=/etc/certificates/etcd/cluster.crt --peer-key-file=/etc/certificates/etcd/cluster.key --peer-trusted-ca-file=/etc/certificates/etcd/ca.crt --auto-compaction-retention=2 --proxy off

stacktrace:

panic: send on closed channel
goroutine 167547782 [running]:
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).do(0xc4258a4ba0, 0xc4202b0960, 0x411648, 0x10)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:169 +0x115
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).Write(0xc4258a4ba0, 0xc420507900, 0xc424ed134a, 0x5, 0x5, 0xc4201b5650, 0x2b, 0x2b, 0xc424ed1350, 0x5, ...)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go:255 +0xe6
github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.(*serverStream).SendMsg(0xc422327d80, 0xf4d7a0, 0xc42497baa0, 0x0, 0x0)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/stream.go:611 +0x275
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*serverStreamWithCtx).SendMsg(0xc422a85e90, 0xf4d7a0, 0xc42497baa0, 0xc42000d2b8, 0xc4204b2df8)
#011<autogenerated>:1 +0x50
github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc422b1dcc0, 0xf4d7a0, 0xc42497baa0, 0x160b820, 0xc425e6f4c0)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/grpc-ecosystem/go-grpc-prometheus/server_metrics.go:179 +0x4b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*leaseLeaseKeepAliveServer).Send(0xc423b2b640, 0xc42497baa0, 0xc425e6f4c0, 0xc8a648e8a853ccb)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go:3687 +0x49
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*LeaseServer).leaseKeepAlive(0xc4244d0580, 0x16110c0, 0xc423b2b640, 0xc4258a4ba0, 0xc420507900)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go:138 +0x18d
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*LeaseServer).LeaseKeepAlive.func1(0xc4258a4c60, 0xc4244d0580, 0x16110c0, 0xc423b2b640)
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go:89 +0x3f
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc.(*LeaseServer).LeaseKeepAlive
#011/tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go:88 +0x91

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 31 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Going to review and move this forward for 3.3. backport.

@gyuho @hexfusion

https://github.com/grpc/grpc-go/pull/2695 is merged.

We should update etcd grpc dependency to fix this bug after grpc 1.12 is released.

Just so you’re aware I have setup 3 nodes with a single etcd cluster running across them. I have it running in a loop to power-off/power-on one of the hosts in the hope of recreating the issue at smaller scale. It has currently got through ~50 loops with no issues so far, but will leave it running.

So in terms of reproduction I think it’s just a matter of issuing a hard shutdown - but it seems the window to hit the issue is small.

@mcginne could you try to recreate this in your testing ENV in a small scale such as single cluster on a single node? Use something to randomly perform the hard shutdown and see if you capture the panic logging. If you can get that far it would be greatly appreciated and will go a long way to getting to the bottom of this. Meanwhile I will take a look at the panic and see if @gyuho has any ideas.