etcd: [3.4] panic: runtime error: invalid memory address or nil pointer dereference

I saw this error multiple times in 3.4 pipeline,

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xab2922]
goroutine 3500 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/home/runner/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000598f00, 0x15792dc, 0x3, 0x3, 0xc0000e34e8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/backend/read_tx.go:195 +0x642
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x67, 0x0, 0x66, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0xc00005e800, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000598f60, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x40fa00, ...)
	/home/runner/work/etcd/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0005d95c0, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:405 +0x1ef
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
FAIL	go.etcd.io/etcd/etcdserver/api/v2store	3.484s
FAIL

Refer to https://github.com/etcd-io/etcd/runs/7460637358?check_suite_focus=true

Please see my comment: https://github.com/etcd-io/etcd/issues/14256#issuecomment-1202083883

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (10 by maintainers)

Most upvoted comments

Thanks @JohnJAS and @rtheis for the feedback. It seems that it’s a different issue with this one, I just raised another issue https://github.com/etcd-io/etcd/issues/14402 to track it. Please feel free to deliver a PR for it if you are interested.

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

This issue can happen on both main and release-3.4, and I believe release-3.5 has this issue as well.

The root cause is the etcd just stops immediately before (*readView) Range is called, so the tx is nil; accordingly the following range operation panics.

I managed to reproduce this issue intentionally by simulating the situation mentioned above. The code change (on 3.4) is below,

wachao-a01:etcd wachao$ git diff
diff --git a/etcdmain/etcd.go b/etcdmain/etcd.go
index 73328a73d..ca76e3910 100644
--- a/etcdmain/etcd.go
+++ b/etcdmain/etcd.go
@@ -294,6 +294,7 @@ func startEtcdOrProxyV2() {
        case <-stopped:
        }
 
+       time.Sleep(10 * time.Second)
        osutil.Exit(0)
 }
 
diff --git a/etcdserver/api/v3rpc/watch.go b/etcdserver/api/v3rpc/watch.go
index 6a9204024..e7c27713a 100644
--- a/etcdserver/api/v3rpc/watch.go
+++ b/etcdserver/api/v3rpc/watch.go
@@ -402,6 +402,11 @@ func (sws *serverWatchStream) sendLoop() {
                                events[i] = &evs[i]
                                if needPrevKV && !isCreateEvent(evs[i]) {
                                        opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1}
+                                       var s *etcdserver.EtcdServer = sws.sg.(*etcdserver.EtcdServer)
+                                       go func() {
+                                               s.Stop()
+                                       }()
+                                       time.Sleep(1*time.Second)
                                        r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt)
                                        if err == nil && len(r.KVs) != 0 {
                                                events[i].PrevKv = &(r.KVs[0])
diff --git a/mvcc/backend/read_tx.go b/mvcc/backend/read_tx.go
index 91fe72ec5..f088e5fa9 100644
--- a/mvcc/backend/read_tx.go
+++ b/mvcc/backend/read_tx.go
@@ -16,6 +16,7 @@ package backend
 
 import (
        "bytes"
+       "fmt"
        "math"
        "sync"
 
@@ -192,6 +193,9 @@ func (rt *concurrentReadTx) UnsafeRange(bucketName, key, endKey []byte, limit in
        rt.txMu.RUnlock()
        if !ok {
                rt.txMu.Lock()
+               if rt.tx == nil {
+                       fmt.Println("###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!")
+               }
                bucket = rt.tx.Bucket(bucketName)
                rt.buckets[bn] = bucket
                rt.txMu.Unlock()

Then the issue can always be reproduced when running test TestV3WatchWithPrevKV,

wachao-a01:integration wachao$ go test -v -cpu 2 -run TestV3WatchWithPrevKV
=== RUN   TestV3WatchWithPrevKV
###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!
    v3_watch_test.go:1212: timeout waiting for watch response
{"level":"warn","ts":"2022-08-02T10:38:23.889+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e005a80b-1a7c-4a2c-83ed-a52cd49a61c4/localhost:83867718980294516900","attempt":0,"error":"rpc error: code = Unknown desc = raft: stopped"}
    v3_watch_test.go:1168: rpc error: code = Unknown desc = raft: stopped
    v3_watch_test.go:1198: rpc error: code = Canceled desc = context canceled
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x174694d]

goroutine 216 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/Users/wachao/go/gopath/pkg/mod/go.etcd.io/bbolt@v1.3.6/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000620300, 0x2316484, 0x3, 0x3, 0xc00011e5b8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/backend/read_tx.go:199 +0x64d
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x3, 0x0, 0x2, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000100000, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000620360, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0002101f0, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000351300, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:410 +0x276
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
exit status 2
FAIL	go.etcd.io/etcd/integration	31.025s

The possibility of running into this issue is really low in production environment. It’s even harder to reproduce this issue in the pipeline after merging https://github.com/etcd-io/etcd/pull/14290 . Since the etcdserver should have already been stopped when running into this issue, it means the last transaction should have already been committed, so it should be safe. So I don’t think it’s a blocker for 3.4.20 any more.

I will think about a formal fix or refactoring in future.

I suggest to resolve this issue firstly. @SimFG @ramses

Comment on https://github.com/etcd-io/etcd/issues/14143,

  1. I am NOT convinced that it’s real issue of etcd. Apart from the lazyfs issue, I even suspect it’s issue on the client side, including the test case or jetcd. Of course, it needs more investigation;
  2. We can start with qualify our corruption detection mechanism (such as https://github.com/etcd-io/etcd/pull/14282 ) and test process (such as Jepsen or our functional test) firstly

@ahrtr Ok, I will look it.