etcd: 3.5.5: client auth failures on new member first startup
What happened?
After cluster upgrade from 3.5.4 to 3.5.5 we are seeing auth issues on new nodes after first startup. Cluster runs in auth-enable mode.
New node joins the cluster, fetches the snapshot, but fails all client requests with:
{"level":"error","ts":"2022-10-11T06:54:28.530Z","caller":"auth/range_perm_cache.go:114","msg":"user doesn't exist","user-name":"XXX","stacktrace":"go.etcd.io/etcd/server/v3/auth.(*authStore).isRangeOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/range_perm_cache.go:114\ngo.etcd.io/etcd/server/v3/auth.(*authStore).isOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:870\ngo.etcd.io/etcd/server/v3/auth.(*authStore).IsRangePermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:882\ngo.etcd.io/etcd/server/v3/etcdserver.checkTxnAuth\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:155\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Txn\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:166\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply.go:177\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:61\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2241\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2143\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1384\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1199\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1122\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\texternal/io_etcd_go_etcd_pkg_v3/schedule/schedule.go:157"}
etcd restart fixes the issue and no requests fail after that.
What did you expect to happen?
New node joins cluster and starts serving requests.
How can we reproduce it (as minimally and precisely as possible)?
WIP
Anything else we need to know?
might be related to https://github.com/etcd-io/etcd/pull/14227/files ?
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.5
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.18.6
Go OS/Arch: linux/amd64
$ etcdctl version
# paste output here
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
+------------------+---------+----------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
| 3dd26ba725e108d1 | started | dfw5a-rh6-2a | https://10.80.109.66:5100 | https://10.80.109.66:5000 | false |
| 4438c22e9bd12975 | started | dfw5a-rg3-5a | https://10.80.98.60:5100 | https://10.80.98.60:5000 | false |
| 6b9cffc673ba60e1 | started | dfw5a-rh11-23b | https://10.80.122.23:5100 | https://10.80.122.23:5000 | false |
| ca3b971d3167b0cf | started | dfw5a-rh11-21a | https://10.80.122.28:5100 | https://10.80.122.28:5000 | false |
| cbfad084fd48d0cf | started | dfw5a-rh11-8a | https://10.80.122.54:5100 | https://10.80.122.54:5000 | false |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
$ etcdctl --endpoints=<member list> endpoint status -w table
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.80.109.66:5000 | 3dd26ba725e108d1 | 3.5.5 | 1.5 GB | false | false | 11952 | 851807436 | 851807436 | |
| 10.80.122.23:5000 | 6b9cffc673ba60e1 | 3.5.5 | 1.7 GB | false | false | 11952 | 851807436 | 851807436 | |
| 10.80.122.28:5000 | ca3b971d3167b0cf | 3.5.5 | 12 GB | false | false | 11952 | 851807437 | 851807437 | |
| 10.80.122.54:5000 | cbfad084fd48d0cf | 3.5.5 | 2.8 GB | true | false | 11952 | 851807437 | 851807437 | |
| 10.80.98.60:5000 | 4438c22e9bd12975 | 3.5.5 | 602 MB | false | false | 11952 | 851807437 | 851807437 | |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Relevant log output
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (14 by maintainers)
seems like fixed an issue.
During the troubleshooting I found that this issue lead to data inconsistency in the cluster. Is it possible new nodes with empty auth cache fail to apply entries from raft log due to auth failures but still advance the position in the log?
Looks good to me.
All related PRs are merged. I think it’s ready to release 3.4.22 and 3.5.6. I will release 3.4.22 sometime in the following week. Please add your comments or concerns in https://github.com/etcd-io/etcd/issues/14651 . @mitake @veshij @serathius @spzala @ptabor