etcd: 3.4.18 etcd data inconsistency

What happened?

A 3 etcd cluster returned inconsistent data.

‘bad’ Node

{
   "level": "info",
   "ts": "2022-06-09T09:21:10.368Z",
   "caller": "mvcc/kvstore_compaction.go:56",
   "msg": "finished scheduled compaction",
   "compact-revision": 165517829,
   "took": "13.589883ms"
}
{
    "level": "fatal",
    "ts": "2022-06-09T09:22:13.472Z",
    "caller": "mvcc/kvstore_txn.go:150",
    "msg": "range failed to find revision pair",
    "revision-main": 165519108,
    "revision-sub": 0,
    "stacktrace": "
          go.etcd.io/etcd/mvcc.(storeTxnRead).rangeKeys (http://go.etcd.io/etcd/mvcc.(storeTxnRead).rangeKeys)
             /home/ec2-user/workplace/etcd/src/etcd/mvcc/kvstore_txn.go:150
          go.etcd.io/etcd/mvcc.(storeTxnRead).Range (http://go.etcd.io/etcd/mvcc.(storeTxnRead).Range)
              /home/ec2-user/workplace/etcd/src/etcd/mvcc/kvstore_txn.go:51
          go.etcd.io/etcd/mvcc.(metricsTxnWrite).Range (http://go.etcd.io/etcd/mvcc.(metricsTxnWrite).Range)
              /home/ec2-user/workplace/etcd/src/etcd/mvcc/metrics_txn.go:37
          go.etcd.io/etcd/etcdserver.(applierV3backend).Range (http://go.etcd.io/etcd/etcdserver.(applierV3backend).Range)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/apply.go:286
          go.etcd.io/etcd/etcdserver.(EtcdServer).Range.func3 (http://go.etcd.io/etcd/etcdserver.(EtcdServer).Range.func3)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/v3_server.go:121
          go.etcd.io/etcd/etcdserver.(EtcdServer).doSerialize (http://go.etcd.io/etcd/etcdserver.(EtcdServer).doSerialize)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/v3_server.go:600
          go.etcd.io/etcd/etcdserver.(EtcdServer).Range (http://go.etcd.io/etcd/etcdserver.(EtcdServer).Range)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/v3_server.go:122
          go.etcd.io/etcd/etcdserver/api/v3rpc.(kvServer).Range (http://go.etcd.io/etcd/etcdserver/api/v3rpc.(kvServer).Range)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/api/v3rpc/key.go:52
          go.etcd.io/etcd/etcdserver/etcdserverpb._KV_Range_Handler.func1 (http://go.etcd.io/etcd/etcdserver/etcdserverpb._KV_Range_Handler.func1)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/etcdserverpb/rpc.pb.go:3554
          github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1 (http://github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1)
              /home/ec2-user/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:31 (http://github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:31)
          github.com/grpc-ecosystem/go-grpc-prometheus.(ServerMetrics).UnaryServerInterceptor.func1 (http://github.com/grpc-ecosystem/go-grpc-prometheus.(ServerMetrics).UnaryServerInterceptor.func1)
              /home/ec2-user/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 (http://github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107)
          github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1 (http://github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1)
              /home/ec2-user/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34 (http://github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34)
          go.etcd.io/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1 (http://go.etcd.io/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/api/v3rpc/interceptor.go:70
          github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1 (http://github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1)
              /home/ec2-user/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34 (http://github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34)
          go.etcd.io/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1 (http://go.etcd.io/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/api/v3rpc/interceptor.go:77
          github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1 (http://github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1)
              /home/ec2-user/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:39 (http://github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:39)
          go.etcd.io/etcd/etcdserver/etcdserverpb._KV_Range_Handler (http://go.etcd.io/etcd/etcdserver/etcdserverpb._KV_Range_Handler)
              /home/ec2-user/workplace/etcd/src/etcd/etcdserver/etcdserverpb/rpc.pb.go:3556
          google.golang.org/grpc.(Server).processUnaryRPC (http://google.golang.org/grpc.(Server).processUnaryRPC)
              /home/ec2-user/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1024 (http://google.golang.org/grpc@v1.26.0/server.go:1024)
          google.golang.org/grpc.(Server).handleStream (http://google.golang.org/grpc.(Server).handleStream)
              /home/ec2-user/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1313 (http://google.golang.org/grpc@v1.26.0/server.go:1313)
          google.golang.org/grpc.(Server).serveStreams.func1.1 (http://google.golang.org/grpc.(Server).serveStreams.func1.1)
              /home/ec2-user/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:722 (http://google.golang.org/grpc@v1.26.0/server.go:722)“
}

kube-controller-manager continuously cannot update its lock due to partial data in etcd.

W0630 19:46:09.303166      10 watcher.go:199] watch chan error: etcdserver: mvcc: required revision has been compacted

E0630 19:45:38.273530      10 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"converting (v1.ConfigMap) to (core.Endpoints): Subsets not present in src"}

What did you expect to happen?

The error range failed to find revision pair should never ever happen because read transaction will either read from write buffer or boltdb.

We reported this case as a reference.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately, I don’t have any clue to reproduce it.

Anything else we need to know?

  1. mvcc revision diverged between nodes.
Screen Shot 2022-07-05 at 12 28 23 PM
  1. db size diverged
Screen Shot 2022-07-05 at 12 28 30 PM
  1. compact failed to apply because of the revision divergence.
  2. consistency checker was not enabled in this old cluster to help oncall operators to retain WAL logs in time.

Etcd version (please run commands below)

$ etcd --version
v3.4.18


Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 1
  • Comments: 22 (19 by maintainers)

Most upvoted comments

@fuweid

  1. Yes. But I did not find any obvious abnormal information in dmesg.
  2. Yes.
  3. Yes. The RaftIndex and AppliedIndex of this node is consistent with leader by ‘endpoint status’

@Tachone Would you mind sharing what experimental-backend-bbolt-freelist-type value you are using? thanks

–experimental-backend-bbolt-freelist-type ‘array’, default value in 3.4.18 @fuweid