zrepl: FreeBSD >= 12.2, < 13.0: Go runtime deadlocks and/or panics on multicore systems

Happened 2 weeks after upgrading my main server to FreeBSD 12.2. Made a goroutine dump of the receiving side. Most interesting stack:

1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa6b4f6 0xa6f205 0xa71f3e 0x46e681
#	0x46a6e4	sync.runtime_Semacquire+0x44							/usr/local/go/src/runtime/sema.go:56
#	0x4796c4	sync.(*WaitGroup).Wait+0x64							/usr/local/go/src/sync/waitgroup.go:130
#	0xa6b4f5	github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).ReadStreamedMessage+0x3b5	/home/circleci/project/rpc/dataconn/stream/stream_conn.go:109
#	0xa6f204	github.com/zrepl/zrepl/rpc/dataconn.(*Client).recv+0x204			/home/circleci/project/rpc/dataconn/dataconn_client.go:87
#	0xa71f3d	github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv.func1+0x7d		/home/circleci/project/rpc/dataconn/dataconn_client.go:162

But:

  • no zfs process on the receiving side
  • other stacks show that the connection was probably still active (heartbeat goroutine existed)
  • sadly didn’t capture sockstat output

Maybe I’m hitting the same bug that @mdtancsa had/has on CURRENT. Let’s see whether it happens again at some point. This machine has hyperthreading disabled but does the (experimental) parallel replication.


goroutine profile: total 27
1 @ 0x40c2d4 0x46aefd 0xa9ae25 0x46e681
#	0x46aefc	os/signal.signal_recv+0x9c	/usr/local/go/src/runtime/sigqueue.go:147
#	0xa9ae24	os/signal.loop+0x24		/usr/local/go/src/os/signal/signal_unix.go:23

1 @ 0x4397a5 0x4065af 0x4061eb 0xa56178 0xa939ff 0xa93366 0xae22c2 0x46e681
#	0xa56177	github.com/zrepl/zrepl/replication/driver.Do.func2+0x77		/home/circleci/project/replication/driver/replication_driver.go:280
#	0xa939fe	github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do+0x37e	/home/circleci/project/daemon/job/active.go:467
#	0xa93365	github.com/zrepl/zrepl/daemon/job.(*ActiveSide).Run+0x425	/home/circleci/project/daemon/job/active.go:424
#	0xae22c1	github.com/zrepl/zrepl/daemon.(*jobs).start.func1+0x141		/home/circleci/project/daemon/daemon.go:248

1 @ 0x4397a5 0x4065af 0x4061eb 0xa57888 0x46e681
#	0xa57887	github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func1+0x47	/home/circleci/project/replication/driver/replication_stepqueue.go:85

1 @ 0x4397a5 0x4065af 0x4061eb 0xa7ccd8 0x46e681
#	0xa7ccd7	github.com/zrepl/zrepl/rpc.NewClient.func1.1+0x37	/home/circleci/project/rpc/rpc_client.go:60

1 @ 0x4397a5 0x4065af 0x4061eb 0xae1d14 0x46e681
#	0xae1d13	github.com/zrepl/zrepl/daemon.Run.func1+0x33	/home/circleci/project/daemon/daemon.go:37

1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d81e5 0x4d81c3 0x53e80f 0x55214e 0x70b2a2 0x4f6f51 0x70b4f3 0x708315 0x70e5df 0x70e5ea 0x56a922 0x4d3627 0x8fb1c9 0x8fb17a 0x8fba45 0x918f52 0x46e681
#	0x468cf4	internal/poll.runtime_pollWait+0x54					/usr/local/go/src/runtime/netpoll.go:220
#	0x4d71a4	internal/poll.(*pollDesc).wait+0x44					/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4d81e4	internal/poll.(*pollDesc).waitRead+0x1a4				/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4d81c2	internal/poll.(*FD).Read+0x182						/usr/local/go/src/internal/poll/fd_unix.go:159
#	0x53e80e	net.(*netFD).Read+0x4e							/usr/local/go/src/net/fd_posix.go:55
#	0x55214d	net.(*conn).Read+0x8d							/usr/local/go/src/net/net.go:182
#	0x70b2a1	crypto/tls.(*atLeastReader).Read+0x61					/usr/local/go/src/crypto/tls/conn.go:779
#	0x4f6f50	bytes.(*Buffer).ReadFrom+0xb0						/usr/local/go/src/bytes/buffer.go:204
#	0x70b4f2	crypto/tls.(*Conn).readFromUntil+0xf2					/usr/local/go/src/crypto/tls/conn.go:801
#	0x708314	crypto/tls.(*Conn).readRecordOrCCS+0x114				/usr/local/go/src/crypto/tls/conn.go:608
#	0x70e5de	crypto/tls.(*Conn).readRecord+0x15e					/usr/local/go/src/crypto/tls/conn.go:576
#	0x70e5e9	crypto/tls.(*Conn).Read+0x169						/usr/local/go/src/crypto/tls/conn.go:1252
#	0x56a921	bufio.(*Reader).Read+0x221						/usr/local/go/src/bufio/bufio.go:227
#	0x4d3626	io.ReadAtLeast+0x86							/usr/local/go/src/io/io.go:314
#	0x8fb1c8	io.ReadFull+0x88							/usr/local/go/src/io/io.go:333
#	0x8fb179	golang.org/x/net/http2.readFrameHeader+0x39				/go/pkg/mod/golang.org/x/net@v0.0.0-20190613194153-d28f0bde5980/http2/frame.go:237
#	0x8fba44	golang.org/x/net/http2.(*Framer).ReadFrame+0xa4				/go/pkg/mod/golang.org/x/net@v0.0.0-20190613194153-d28f0bde5980/http2/frame.go:492
#	0x918f51	google.golang.org/grpc/internal/transport.(*http2Client).reader+0x171	/go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/http2_client.go:1218

1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d81e5 0x4d81c3 0x53e80f 0x55214e 0x7c7638 0x46e681
#	0x468cf4	internal/poll.runtime_pollWait+0x54		/usr/local/go/src/runtime/netpoll.go:220
#	0x4d71a4	internal/poll.(*pollDesc).wait+0x44		/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4d81e4	internal/poll.(*pollDesc).waitRead+0x1a4	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4d81c2	internal/poll.(*FD).Read+0x182			/usr/local/go/src/internal/poll/fd_unix.go:159
#	0x53e80e	net.(*netFD).Read+0x4e				/usr/local/go/src/net/fd_posix.go:55
#	0x55214d	net.(*conn).Read+0x8d				/usr/local/go/src/net/net.go:182
#	0x7c7637	net/http.(*connReader).backgroundRead+0x57	/usr/local/go/src/net/http/server.go:690

1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d9d9c 0x4d9d7e 0x53fd85 0x55c132 0x55af65 0x7d2466 0xae23cc 0xae2362 0x46e681
#	0x468cf4	internal/poll.runtime_pollWait+0x54					/usr/local/go/src/runtime/netpoll.go:220
#	0x4d71a4	internal/poll.(*pollDesc).wait+0x44					/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4d9d9b	internal/poll.(*pollDesc).waitRead+0x1fb				/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4d9d7d	internal/poll.(*FD).Accept+0x1dd					/usr/local/go/src/internal/poll/fd_unix.go:394
#	0x53fd84	net.(*netFD).accept+0x44						/usr/local/go/src/net/fd_unix.go:172
#	0x55c131	net.(*TCPListener).accept+0x31						/usr/local/go/src/net/tcpsock_posix.go:139
#	0x55af64	net.(*TCPListener).Accept+0x64						/usr/local/go/src/net/tcpsock.go:261
#	0x7d2465	net/http.(*Server).Serve+0x265						/usr/local/go/src/net/http/server.go:2937
#	0xae23cb	net/http.Serve+0x8b							/usr/local/go/src/net/http/server.go:2498
#	0xae2361	github.com/zrepl/zrepl/daemon.(*pprofServer).controlLoop.func1+0x21	/home/circleci/project/daemon/pprof.go:74

1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d9d9c 0x4d9d7e 0x53fd85 0x562572 0x560805 0x7d2466 0xae19c5 0x46e681
#	0x468cf4	internal/poll.runtime_pollWait+0x54				/usr/local/go/src/runtime/netpoll.go:220
#	0x4d71a4	internal/poll.(*pollDesc).wait+0x44				/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4d9d9b	internal/poll.(*pollDesc).waitRead+0x1fb			/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4d9d7d	internal/poll.(*FD).Accept+0x1dd				/usr/local/go/src/internal/poll/fd_unix.go:394
#	0x53fd84	net.(*netFD).accept+0x44					/usr/local/go/src/net/fd_unix.go:172
#	0x562571	net.(*UnixListener).accept+0x31					/usr/local/go/src/net/unixsock_posix.go:162
#	0x560804	net.(*UnixListener).Accept+0x64					/usr/local/go/src/net/unixsock.go:260
#	0x7d2465	net/http.(*Server).Serve+0x265					/usr/local/go/src/net/http/server.go:2937
#	0xae19c4	github.com/zrepl/zrepl/daemon.(*controlJob).Run.func5+0x44	/home/circleci/project/daemon/control.go:168

1 @ 0x4397a5 0x44982f 0x838e07 0x46e681
#	0x838e06	github.com/zrepl/zrepl/daemon/logging/trace.init.2.func1+0x3c6	/home/circleci/project/daemon/logging/trace/trace_chrometrace.go:146

1 @ 0x4397a5 0x44982f 0x90971b 0x909f53 0x92999b 0x46e681
#	0x90971a	google.golang.org/grpc/internal/transport.(*controlBuffer).get+0x11a	/go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/controlbuf.go:317
#	0x909f52	google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x1d2	/go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/controlbuf.go:435
#	0x92999a	google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x7a	/go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/http2_client.go:328

1 @ 0x4397a5 0x44982f 0x919885 0x46e681
#	0x919884	google.golang.org/grpc/internal/transport.(*http2Client).keepalive+0x2e4	/go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/http2_client.go:1299

1 @ 0x4397a5 0x44982f 0x95e319 0x46e681
#	0x95e318	google.golang.org/grpc.(*ccBalancerWrapper).watcher+0x118	/go/pkg/mod/google.golang.org/grpc@v1.17.0/balancer_conn_wrappers.go:122

1 @ 0x4397a5 0x44982f 0x9619e5 0xa7ce8d 0x46e681
#	0x9619e4	google.golang.org/grpc.(*ClientConn).WaitForStateChange+0x104	/go/pkg/mod/google.golang.org/grpc@v1.17.0/clientconn.go:417
#	0xa7ce8c	github.com/zrepl/zrepl/rpc.NewClient.func1+0x18c		/home/circleci/project/rpc/rpc_client.go:67

1 @ 0x4397a5 0x44982f 0xa49ee9 0xa46f8d 0xa90d3f 0xa90cfe 0x46e681
#	0xa49ee8	github.com/zrepl/zrepl/daemon/snapper.wait+0x1c8			/home/circleci/project/daemon/snapper/snapper.go:376
#	0xa46f8c	github.com/zrepl/zrepl/daemon/snapper.(*Snapper).Run+0x24c		/home/circleci/project/daemon/snapper/snapper.go:158
#	0xa90d3e	github.com/zrepl/zrepl/daemon/snapper.(*PeriodicOrManual).Run+0x5e	/home/circleci/project/daemon/snapper/snapper_all.go:23
#	0xa90cfd	github.com/zrepl/zrepl/daemon/job.(*modePush).RunPeriodic+0x1d		/home/circleci/project/daemon/job/active.go:132

1 @ 0x4397a5 0x44982f 0xa57c3b 0x46e681
#	0xa57c3a	github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func2+0x15a	/home/circleci/project/replication/driver/replication_stepqueue.go:92

1 @ 0x4397a5 0x44982f 0xa6864e 0x46e681
#	0xa6864d	github.com/zrepl/zrepl/rpc/dataconn/heartbeatconn.(*Conn).sendHeartbeats+0x1ad	/home/circleci/project/rpc/dataconn/heartbeatconn/heartbeatconn.go:84

1 @ 0x4397a5 0x44982f 0xa6fdfd 0xa7ae11 0xa5e0c3 0xa5a85f 0xa57554 0x82683d 0xa521ce 0xa567bc 0x46e681
#	0xa6fdfc	github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv+0x35c			/home/circleci/project/rpc/dataconn/dataconn_client.go:183
#	0xa7ae10	github.com/zrepl/zrepl/rpc.(*Client).Receive+0xd0				/home/circleci/project/rpc/rpc_client.go:108
#	0xa5e0c2	github.com/zrepl/zrepl/replication/logic.(*Step).doReplication+0x462		/home/circleci/project/replication/logic/replication_logic.go:620
#	0xa5a85e	github.com/zrepl/zrepl/replication/logic.(*Step).Step+0x3e			/home/circleci/project/replication/logic/replication_logic.go:187
#	0xa57553	github.com/zrepl/zrepl/replication/driver.(*fs).do.func5+0x1f3			/home/circleci/project/replication/driver/replication_driver.go:641
#	0x82683c	github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c			/home/circleci/project/util/chainlock/chainlock.go:41
#	0xa521cd	github.com/zrepl/zrepl/replication/driver.(*fs).do+0x94d			/home/circleci/project/replication/driver/replication_driver.go:634
#	0xa567bb	github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems.func1+0x13b	/home/circleci/project/replication/driver/replication_driver.go:433

1 @ 0x4397a5 0x44982f 0xa98809 0x46e681
#	0xa98808	github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do.func1+0x108	/home/circleci/project/daemon/job/active.go:438

1 @ 0x4397a5 0x44982f 0xadd465 0xae22c2 0x46e681
#	0xadd464	github.com/zrepl/zrepl/daemon.(*controlJob).Run+0x744	/home/circleci/project/daemon/control.go:172
#	0xae22c1	github.com/zrepl/zrepl/daemon.(*jobs).start.func1+0x141	/home/circleci/project/daemon/daemon.go:248

1 @ 0x4397a5 0x44982f 0xadf032 0xae1429 0x874fa2 0x5ce502 0x5cf11e 0x8754cd 0x8754be 0xb23bc5 0x4393a9 0x46e681
#	0xadf031	github.com/zrepl/zrepl/daemon.Run+0xb11			/home/circleci/project/daemon/daemon.go:111
#	0xae1428	github.com/zrepl/zrepl/daemon.glob..func1+0x48		/home/circleci/project/daemon/main.go:16
#	0x874fa1	github.com/zrepl/zrepl/cli.(*Subcommand).run+0xe1	/home/circleci/project/cli/cli.go:105
#	0x5ce501	github.com/spf13/cobra.(*Command).execute+0x2c1		/go/pkg/mod/github.com/spf13/cobra@v0.0.2/command.go:760
#	0x5cf11d	github.com/spf13/cobra.(*Command).ExecuteC+0x2fd	/go/pkg/mod/github.com/spf13/cobra@v0.0.2/command.go:846
#	0x8754cc	github.com/spf13/cobra.(*Command).Execute+0x2c		/go/pkg/mod/github.com/spf13/cobra@v0.0.2/command.go:794
#	0x8754bd	github.com/zrepl/zrepl/cli.Run+0x1d			/home/circleci/project/cli/cli.go:152
#	0xb23bc4	main.main+0x24						/home/circleci/project/main.go:24
#	0x4393a8	runtime.main+0x208					/usr/local/go/src/runtime/proc.go:204

1 @ 0x4397a5 0x44982f 0xae0788 0x46e681
#	0xae0787	github.com/zrepl/zrepl/daemon.(*pprofServer).controlLoop+0xe7	/home/circleci/project/daemon/pprof.go:45

1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa5684a 0x82683d 0xa513f2 0xa4f7ba 0xa55108 0x82683d 0xa5564f 0x46e681
#	0x46a6e4	sync.runtime_Semacquire+0x44							/usr/local/go/src/runtime/sema.go:56
#	0x4796c4	sync.(*WaitGroup).Wait+0x64							/usr/local/go/src/sync/waitgroup.go:130
#	0xa56849	github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems.func2+0x29	/home/circleci/project/replication/driver/replication_driver.go:437
#	0x82683c	github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c			/home/circleci/project/util/chainlock/chainlock.go:41
#	0xa513f1	github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems+0x311	/home/circleci/project/replication/driver/replication_driver.go:436
#	0xa4f7b9	github.com/zrepl/zrepl/replication/driver.(*attempt).do+0x79			/home/circleci/project/replication/driver/replication_driver.go:301
#	0xa55107	github.com/zrepl/zrepl/replication/driver.Do.func1.1+0x47			/home/circleci/project/replication/driver/replication_driver.go:219
#	0x82683c	github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c			/home/circleci/project/util/chainlock/chainlock.go:41
#	0xa5564e	github.com/zrepl/zrepl/replication/driver.Do.func1+0x3ce			/home/circleci/project/replication/driver/replication_driver.go:218

1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa6b4f6 0xa6f205 0xa71f3e 0x46e681
#	0x46a6e4	sync.runtime_Semacquire+0x44							/usr/local/go/src/runtime/sema.go:56
#	0x4796c4	sync.(*WaitGroup).Wait+0x64							/usr/local/go/src/sync/waitgroup.go:130
#	0xa6b4f5	github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).ReadStreamedMessage+0x3b5	/home/circleci/project/rpc/dataconn/stream/stream_conn.go:109
#	0xa6f204	github.com/zrepl/zrepl/rpc/dataconn.(*Client).recv+0x204			/home/circleci/project/rpc/dataconn/dataconn_client.go:87
#	0xa71f3d	github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv.func1+0x7d		/home/circleci/project/rpc/dataconn/dataconn_client.go:162

1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xae206d 0x46e681
#	0x46a6e4	sync.runtime_Semacquire+0x44				/usr/local/go/src/runtime/sema.go:56
#	0x4796c4	sync.(*WaitGroup).Wait+0x64				/usr/local/go/src/sync/waitgroup.go:130
#	0xae206c	github.com/zrepl/zrepl/daemon.(*jobs).wait.func1+0x2c	/home/circleci/project/daemon/daemon.go:144

1 @ 0x4397a5 0x46a998 0x46a96e 0x475edd 0xa57de8 0x46e681
#	0x46a96d	sync.runtime_notifyListWait+0xcd					/usr/local/go/src/runtime/sema.go:513
#	0x475edc	sync.(*Cond).Wait+0x9c							/usr/local/go/src/sync/cond.go:56
#	0xa57de7	github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func3+0x67	/home/circleci/project/replication/driver/replication_stepqueue.go:121

1 @ 0x4688dd 0xabc582 0xabc345 0xab8f32 0xac6925 0xac8205 0x7ceaa4 0x7d09cd 0x7d20a3 0x7cd8ad 0x46e681
#	0x4688dc	runtime/pprof.runtime_goroutineProfileWithLabels+0x5c	/usr/local/go/src/runtime/mprof.go:716
#	0xabc581	runtime/pprof.writeRuntimeProfile+0xe1			/usr/local/go/src/runtime/pprof/pprof.go:724
#	0xabc344	runtime/pprof.writeGoroutine+0xa4			/usr/local/go/src/runtime/pprof/pprof.go:684
#	0xab8f31	runtime/pprof.(*Profile).WriteTo+0x3f1			/usr/local/go/src/runtime/pprof/pprof.go:331
#	0xac6924	net/http/pprof.handler.ServeHTTP+0x384			/usr/local/go/src/net/http/pprof/pprof.go:256
#	0xac8204	net/http/pprof.Index+0x944				/usr/local/go/src/net/http/pprof/pprof.go:367
#	0x7ceaa3	net/http.HandlerFunc.ServeHTTP+0x43			/usr/local/go/src/net/http/server.go:2042
#	0x7d09cc	net/http.(*ServeMux).ServeHTTP+0x1ac			/usr/local/go/src/net/http/server.go:2417
#	0x7d20a2	net/http.serverHandler.ServeHTTP+0xa2			/usr/local/go/src/net/http/server.go:2843
#	0x7cd8ac	net/http.(*conn).serve+0x8ac				/usr/local/go/src/net/http/server.go:1925

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 62 (35 by maintainers)

Most upvoted comments

Errata updates with the fix are now available via freebsd-update with the fix.

  • FreeBSD 12.2-RELASE-p12
  • FreeBSD 12.3-RELEASE-p1

Since I upgraded to 13 a few weeks ago, I have run zrepl without errors and without a trimmed down cpuset. So, one more sign that the fix that @emaste linked to above did indeed fix this issue #411.

Given @mdtancsa 's positive experience on 12.3, I’ll close this issue. I will change the title to >= 12.2, < 13.0, and update once the errata for 12.3 comes out.

I have been running the commit for the past 5 days and no deadlocks or crashes. Over the weekend, the server in question runs quite a busy load and normally would result in at least 3-4 crashes a day. I would say this fixed the issue for me at this point. FreeBSD 12.3-STABLE #14 r371237 Tue Dec 14 11:48:29 EST 2021 Thank you @avg-I for fixing this long standing bug!