zrepl: FreeBSD >= 12.2, < 13.0: Go runtime deadlocks and/or panics on multicore systems
Happened 2 weeks after upgrading my main server to FreeBSD 12.2. Made a goroutine dump of the receiving side. Most interesting stack:
1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa6b4f6 0xa6f205 0xa71f3e 0x46e681
# 0x46a6e4 sync.runtime_Semacquire+0x44 /usr/local/go/src/runtime/sema.go:56
# 0x4796c4 sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130
# 0xa6b4f5 github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).ReadStreamedMessage+0x3b5 /home/circleci/project/rpc/dataconn/stream/stream_conn.go:109
# 0xa6f204 github.com/zrepl/zrepl/rpc/dataconn.(*Client).recv+0x204 /home/circleci/project/rpc/dataconn/dataconn_client.go:87
# 0xa71f3d github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv.func1+0x7d /home/circleci/project/rpc/dataconn/dataconn_client.go:162
But:
- no
zfsprocess on the receiving side - other stacks show that the connection was probably still active (heartbeat goroutine existed)
- sadly didn’t capture
sockstatoutput
Maybe I’m hitting the same bug that @mdtancsa had/has on CURRENT. Let’s see whether it happens again at some point. This machine has hyperthreading disabled but does the (experimental) parallel replication.
goroutine profile: total 27
1 @ 0x40c2d4 0x46aefd 0xa9ae25 0x46e681
# 0x46aefc os/signal.signal_recv+0x9c /usr/local/go/src/runtime/sigqueue.go:147
# 0xa9ae24 os/signal.loop+0x24 /usr/local/go/src/os/signal/signal_unix.go:23
1 @ 0x4397a5 0x4065af 0x4061eb 0xa56178 0xa939ff 0xa93366 0xae22c2 0x46e681
# 0xa56177 github.com/zrepl/zrepl/replication/driver.Do.func2+0x77 /home/circleci/project/replication/driver/replication_driver.go:280
# 0xa939fe github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do+0x37e /home/circleci/project/daemon/job/active.go:467
# 0xa93365 github.com/zrepl/zrepl/daemon/job.(*ActiveSide).Run+0x425 /home/circleci/project/daemon/job/active.go:424
# 0xae22c1 github.com/zrepl/zrepl/daemon.(*jobs).start.func1+0x141 /home/circleci/project/daemon/daemon.go:248
1 @ 0x4397a5 0x4065af 0x4061eb 0xa57888 0x46e681
# 0xa57887 github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func1+0x47 /home/circleci/project/replication/driver/replication_stepqueue.go:85
1 @ 0x4397a5 0x4065af 0x4061eb 0xa7ccd8 0x46e681
# 0xa7ccd7 github.com/zrepl/zrepl/rpc.NewClient.func1.1+0x37 /home/circleci/project/rpc/rpc_client.go:60
1 @ 0x4397a5 0x4065af 0x4061eb 0xae1d14 0x46e681
# 0xae1d13 github.com/zrepl/zrepl/daemon.Run.func1+0x33 /home/circleci/project/daemon/daemon.go:37
1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d81e5 0x4d81c3 0x53e80f 0x55214e 0x70b2a2 0x4f6f51 0x70b4f3 0x708315 0x70e5df 0x70e5ea 0x56a922 0x4d3627 0x8fb1c9 0x8fb17a 0x8fba45 0x918f52 0x46e681
# 0x468cf4 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:220
# 0x4d71a4 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4d81e4 internal/poll.(*pollDesc).waitRead+0x1a4 /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4d81c2 internal/poll.(*FD).Read+0x182 /usr/local/go/src/internal/poll/fd_unix.go:159
# 0x53e80e net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_posix.go:55
# 0x55214d net.(*conn).Read+0x8d /usr/local/go/src/net/net.go:182
# 0x70b2a1 crypto/tls.(*atLeastReader).Read+0x61 /usr/local/go/src/crypto/tls/conn.go:779
# 0x4f6f50 bytes.(*Buffer).ReadFrom+0xb0 /usr/local/go/src/bytes/buffer.go:204
# 0x70b4f2 crypto/tls.(*Conn).readFromUntil+0xf2 /usr/local/go/src/crypto/tls/conn.go:801
# 0x708314 crypto/tls.(*Conn).readRecordOrCCS+0x114 /usr/local/go/src/crypto/tls/conn.go:608
# 0x70e5de crypto/tls.(*Conn).readRecord+0x15e /usr/local/go/src/crypto/tls/conn.go:576
# 0x70e5e9 crypto/tls.(*Conn).Read+0x169 /usr/local/go/src/crypto/tls/conn.go:1252
# 0x56a921 bufio.(*Reader).Read+0x221 /usr/local/go/src/bufio/bufio.go:227
# 0x4d3626 io.ReadAtLeast+0x86 /usr/local/go/src/io/io.go:314
# 0x8fb1c8 io.ReadFull+0x88 /usr/local/go/src/io/io.go:333
# 0x8fb179 golang.org/x/net/http2.readFrameHeader+0x39 /go/pkg/mod/golang.org/x/net@v0.0.0-20190613194153-d28f0bde5980/http2/frame.go:237
# 0x8fba44 golang.org/x/net/http2.(*Framer).ReadFrame+0xa4 /go/pkg/mod/golang.org/x/net@v0.0.0-20190613194153-d28f0bde5980/http2/frame.go:492
# 0x918f51 google.golang.org/grpc/internal/transport.(*http2Client).reader+0x171 /go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/http2_client.go:1218
1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d81e5 0x4d81c3 0x53e80f 0x55214e 0x7c7638 0x46e681
# 0x468cf4 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:220
# 0x4d71a4 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4d81e4 internal/poll.(*pollDesc).waitRead+0x1a4 /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4d81c2 internal/poll.(*FD).Read+0x182 /usr/local/go/src/internal/poll/fd_unix.go:159
# 0x53e80e net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_posix.go:55
# 0x55214d net.(*conn).Read+0x8d /usr/local/go/src/net/net.go:182
# 0x7c7637 net/http.(*connReader).backgroundRead+0x57 /usr/local/go/src/net/http/server.go:690
1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d9d9c 0x4d9d7e 0x53fd85 0x55c132 0x55af65 0x7d2466 0xae23cc 0xae2362 0x46e681
# 0x468cf4 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:220
# 0x4d71a4 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4d9d9b internal/poll.(*pollDesc).waitRead+0x1fb /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4d9d7d internal/poll.(*FD).Accept+0x1dd /usr/local/go/src/internal/poll/fd_unix.go:394
# 0x53fd84 net.(*netFD).accept+0x44 /usr/local/go/src/net/fd_unix.go:172
# 0x55c131 net.(*TCPListener).accept+0x31 /usr/local/go/src/net/tcpsock_posix.go:139
# 0x55af64 net.(*TCPListener).Accept+0x64 /usr/local/go/src/net/tcpsock.go:261
# 0x7d2465 net/http.(*Server).Serve+0x265 /usr/local/go/src/net/http/server.go:2937
# 0xae23cb net/http.Serve+0x8b /usr/local/go/src/net/http/server.go:2498
# 0xae2361 github.com/zrepl/zrepl/daemon.(*pprofServer).controlLoop.func1+0x21 /home/circleci/project/daemon/pprof.go:74
1 @ 0x4397a5 0x43257b 0x468cf5 0x4d71a5 0x4d9d9c 0x4d9d7e 0x53fd85 0x562572 0x560805 0x7d2466 0xae19c5 0x46e681
# 0x468cf4 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:220
# 0x4d71a4 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4d9d9b internal/poll.(*pollDesc).waitRead+0x1fb /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4d9d7d internal/poll.(*FD).Accept+0x1dd /usr/local/go/src/internal/poll/fd_unix.go:394
# 0x53fd84 net.(*netFD).accept+0x44 /usr/local/go/src/net/fd_unix.go:172
# 0x562571 net.(*UnixListener).accept+0x31 /usr/local/go/src/net/unixsock_posix.go:162
# 0x560804 net.(*UnixListener).Accept+0x64 /usr/local/go/src/net/unixsock.go:260
# 0x7d2465 net/http.(*Server).Serve+0x265 /usr/local/go/src/net/http/server.go:2937
# 0xae19c4 github.com/zrepl/zrepl/daemon.(*controlJob).Run.func5+0x44 /home/circleci/project/daemon/control.go:168
1 @ 0x4397a5 0x44982f 0x838e07 0x46e681
# 0x838e06 github.com/zrepl/zrepl/daemon/logging/trace.init.2.func1+0x3c6 /home/circleci/project/daemon/logging/trace/trace_chrometrace.go:146
1 @ 0x4397a5 0x44982f 0x90971b 0x909f53 0x92999b 0x46e681
# 0x90971a google.golang.org/grpc/internal/transport.(*controlBuffer).get+0x11a /go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/controlbuf.go:317
# 0x909f52 google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x1d2 /go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/controlbuf.go:435
# 0x92999a google.golang.org/grpc/internal/transport.newHTTP2Client.func3+0x7a /go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/http2_client.go:328
1 @ 0x4397a5 0x44982f 0x919885 0x46e681
# 0x919884 google.golang.org/grpc/internal/transport.(*http2Client).keepalive+0x2e4 /go/pkg/mod/google.golang.org/grpc@v1.17.0/internal/transport/http2_client.go:1299
1 @ 0x4397a5 0x44982f 0x95e319 0x46e681
# 0x95e318 google.golang.org/grpc.(*ccBalancerWrapper).watcher+0x118 /go/pkg/mod/google.golang.org/grpc@v1.17.0/balancer_conn_wrappers.go:122
1 @ 0x4397a5 0x44982f 0x9619e5 0xa7ce8d 0x46e681
# 0x9619e4 google.golang.org/grpc.(*ClientConn).WaitForStateChange+0x104 /go/pkg/mod/google.golang.org/grpc@v1.17.0/clientconn.go:417
# 0xa7ce8c github.com/zrepl/zrepl/rpc.NewClient.func1+0x18c /home/circleci/project/rpc/rpc_client.go:67
1 @ 0x4397a5 0x44982f 0xa49ee9 0xa46f8d 0xa90d3f 0xa90cfe 0x46e681
# 0xa49ee8 github.com/zrepl/zrepl/daemon/snapper.wait+0x1c8 /home/circleci/project/daemon/snapper/snapper.go:376
# 0xa46f8c github.com/zrepl/zrepl/daemon/snapper.(*Snapper).Run+0x24c /home/circleci/project/daemon/snapper/snapper.go:158
# 0xa90d3e github.com/zrepl/zrepl/daemon/snapper.(*PeriodicOrManual).Run+0x5e /home/circleci/project/daemon/snapper/snapper_all.go:23
# 0xa90cfd github.com/zrepl/zrepl/daemon/job.(*modePush).RunPeriodic+0x1d /home/circleci/project/daemon/job/active.go:132
1 @ 0x4397a5 0x44982f 0xa57c3b 0x46e681
# 0xa57c3a github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func2+0x15a /home/circleci/project/replication/driver/replication_stepqueue.go:92
1 @ 0x4397a5 0x44982f 0xa6864e 0x46e681
# 0xa6864d github.com/zrepl/zrepl/rpc/dataconn/heartbeatconn.(*Conn).sendHeartbeats+0x1ad /home/circleci/project/rpc/dataconn/heartbeatconn/heartbeatconn.go:84
1 @ 0x4397a5 0x44982f 0xa6fdfd 0xa7ae11 0xa5e0c3 0xa5a85f 0xa57554 0x82683d 0xa521ce 0xa567bc 0x46e681
# 0xa6fdfc github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv+0x35c /home/circleci/project/rpc/dataconn/dataconn_client.go:183
# 0xa7ae10 github.com/zrepl/zrepl/rpc.(*Client).Receive+0xd0 /home/circleci/project/rpc/rpc_client.go:108
# 0xa5e0c2 github.com/zrepl/zrepl/replication/logic.(*Step).doReplication+0x462 /home/circleci/project/replication/logic/replication_logic.go:620
# 0xa5a85e github.com/zrepl/zrepl/replication/logic.(*Step).Step+0x3e /home/circleci/project/replication/logic/replication_logic.go:187
# 0xa57553 github.com/zrepl/zrepl/replication/driver.(*fs).do.func5+0x1f3 /home/circleci/project/replication/driver/replication_driver.go:641
# 0x82683c github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c /home/circleci/project/util/chainlock/chainlock.go:41
# 0xa521cd github.com/zrepl/zrepl/replication/driver.(*fs).do+0x94d /home/circleci/project/replication/driver/replication_driver.go:634
# 0xa567bb github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems.func1+0x13b /home/circleci/project/replication/driver/replication_driver.go:433
1 @ 0x4397a5 0x44982f 0xa98809 0x46e681
# 0xa98808 github.com/zrepl/zrepl/daemon/job.(*ActiveSide).do.func1+0x108 /home/circleci/project/daemon/job/active.go:438
1 @ 0x4397a5 0x44982f 0xadd465 0xae22c2 0x46e681
# 0xadd464 github.com/zrepl/zrepl/daemon.(*controlJob).Run+0x744 /home/circleci/project/daemon/control.go:172
# 0xae22c1 github.com/zrepl/zrepl/daemon.(*jobs).start.func1+0x141 /home/circleci/project/daemon/daemon.go:248
1 @ 0x4397a5 0x44982f 0xadf032 0xae1429 0x874fa2 0x5ce502 0x5cf11e 0x8754cd 0x8754be 0xb23bc5 0x4393a9 0x46e681
# 0xadf031 github.com/zrepl/zrepl/daemon.Run+0xb11 /home/circleci/project/daemon/daemon.go:111
# 0xae1428 github.com/zrepl/zrepl/daemon.glob..func1+0x48 /home/circleci/project/daemon/main.go:16
# 0x874fa1 github.com/zrepl/zrepl/cli.(*Subcommand).run+0xe1 /home/circleci/project/cli/cli.go:105
# 0x5ce501 github.com/spf13/cobra.(*Command).execute+0x2c1 /go/pkg/mod/github.com/spf13/cobra@v0.0.2/command.go:760
# 0x5cf11d github.com/spf13/cobra.(*Command).ExecuteC+0x2fd /go/pkg/mod/github.com/spf13/cobra@v0.0.2/command.go:846
# 0x8754cc github.com/spf13/cobra.(*Command).Execute+0x2c /go/pkg/mod/github.com/spf13/cobra@v0.0.2/command.go:794
# 0x8754bd github.com/zrepl/zrepl/cli.Run+0x1d /home/circleci/project/cli/cli.go:152
# 0xb23bc4 main.main+0x24 /home/circleci/project/main.go:24
# 0x4393a8 runtime.main+0x208 /usr/local/go/src/runtime/proc.go:204
1 @ 0x4397a5 0x44982f 0xae0788 0x46e681
# 0xae0787 github.com/zrepl/zrepl/daemon.(*pprofServer).controlLoop+0xe7 /home/circleci/project/daemon/pprof.go:45
1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa5684a 0x82683d 0xa513f2 0xa4f7ba 0xa55108 0x82683d 0xa5564f 0x46e681
# 0x46a6e4 sync.runtime_Semacquire+0x44 /usr/local/go/src/runtime/sema.go:56
# 0x4796c4 sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130
# 0xa56849 github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems.func2+0x29 /home/circleci/project/replication/driver/replication_driver.go:437
# 0x82683c github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c /home/circleci/project/util/chainlock/chainlock.go:41
# 0xa513f1 github.com/zrepl/zrepl/replication/driver.(*attempt).doFilesystems+0x311 /home/circleci/project/replication/driver/replication_driver.go:436
# 0xa4f7b9 github.com/zrepl/zrepl/replication/driver.(*attempt).do+0x79 /home/circleci/project/replication/driver/replication_driver.go:301
# 0xa55107 github.com/zrepl/zrepl/replication/driver.Do.func1.1+0x47 /home/circleci/project/replication/driver/replication_driver.go:219
# 0x82683c github.com/zrepl/zrepl/util/chainlock.(*L).DropWhile+0x5c /home/circleci/project/util/chainlock/chainlock.go:41
# 0xa5564e github.com/zrepl/zrepl/replication/driver.Do.func1+0x3ce /home/circleci/project/replication/driver/replication_driver.go:218
1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xa6b4f6 0xa6f205 0xa71f3e 0x46e681
# 0x46a6e4 sync.runtime_Semacquire+0x44 /usr/local/go/src/runtime/sema.go:56
# 0x4796c4 sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130
# 0xa6b4f5 github.com/zrepl/zrepl/rpc/dataconn/stream.(*Conn).ReadStreamedMessage+0x3b5 /home/circleci/project/rpc/dataconn/stream/stream_conn.go:109
# 0xa6f204 github.com/zrepl/zrepl/rpc/dataconn.(*Client).recv+0x204 /home/circleci/project/rpc/dataconn/dataconn_client.go:87
# 0xa71f3d github.com/zrepl/zrepl/rpc/dataconn.(*Client).ReqRecv.func1+0x7d /home/circleci/project/rpc/dataconn/dataconn_client.go:162
1 @ 0x4397a5 0x44a1e5 0x44a1ce 0x46a6e5 0x4796c5 0xae206d 0x46e681
# 0x46a6e4 sync.runtime_Semacquire+0x44 /usr/local/go/src/runtime/sema.go:56
# 0x4796c4 sync.(*WaitGroup).Wait+0x64 /usr/local/go/src/sync/waitgroup.go:130
# 0xae206c github.com/zrepl/zrepl/daemon.(*jobs).wait.func1+0x2c /home/circleci/project/daemon/daemon.go:144
1 @ 0x4397a5 0x46a998 0x46a96e 0x475edd 0xa57de8 0x46e681
# 0x46a96d sync.runtime_notifyListWait+0xcd /usr/local/go/src/runtime/sema.go:513
# 0x475edc sync.(*Cond).Wait+0x9c /usr/local/go/src/sync/cond.go:56
# 0xa57de7 github.com/zrepl/zrepl/replication/driver.(*stepQueue).Start.func3+0x67 /home/circleci/project/replication/driver/replication_stepqueue.go:121
1 @ 0x4688dd 0xabc582 0xabc345 0xab8f32 0xac6925 0xac8205 0x7ceaa4 0x7d09cd 0x7d20a3 0x7cd8ad 0x46e681
# 0x4688dc runtime/pprof.runtime_goroutineProfileWithLabels+0x5c /usr/local/go/src/runtime/mprof.go:716
# 0xabc581 runtime/pprof.writeRuntimeProfile+0xe1 /usr/local/go/src/runtime/pprof/pprof.go:724
# 0xabc344 runtime/pprof.writeGoroutine+0xa4 /usr/local/go/src/runtime/pprof/pprof.go:684
# 0xab8f31 runtime/pprof.(*Profile).WriteTo+0x3f1 /usr/local/go/src/runtime/pprof/pprof.go:331
# 0xac6924 net/http/pprof.handler.ServeHTTP+0x384 /usr/local/go/src/net/http/pprof/pprof.go:256
# 0xac8204 net/http/pprof.Index+0x944 /usr/local/go/src/net/http/pprof/pprof.go:367
# 0x7ceaa3 net/http.HandlerFunc.ServeHTTP+0x43 /usr/local/go/src/net/http/server.go:2042
# 0x7d09cc net/http.(*ServeMux).ServeHTTP+0x1ac /usr/local/go/src/net/http/server.go:2417
# 0x7d20a2 net/http.serverHandler.ServeHTTP+0xa2 /usr/local/go/src/net/http/server.go:2843
# 0x7cd8ac net/http.(*conn).serve+0x8ac /usr/local/go/src/net/http/server.go:1925
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 62 (35 by maintainers)
Errata updates with the fix are now available via
freebsd-updatewith the fix.Since I upgraded to 13 a few weeks ago, I have run zrepl without errors and without a trimmed down
cpuset. So, one more sign that the fix that @emaste linked to above did indeed fix this issue #411.Given @mdtancsa 's positive experience on 12.3, I’ll close this issue. I will change the title to >= 12.2, < 13.0, and update once the errata for 12.3 comes out.
I have been running the commit for the past 5 days and no deadlocks or crashes. Over the weekend, the server in question runs quite a busy load and normally would result in at least 3-4 crashes a day. I would say this fixed the issue for me at this point. FreeBSD 12.3-STABLE #14 r371237 Tue Dec 14 11:48:29 EST 2021 Thank you @avg-I for fixing this long standing bug!