etcd: 3.5.5: client auth failures on new member first startup

What happened?

After cluster upgrade from 3.5.4 to 3.5.5 we are seeing auth issues on new nodes after first startup. Cluster runs in auth-enable mode.

New node joins the cluster, fetches the snapshot, but fails all client requests with:

{"level":"error","ts":"2022-10-11T06:54:28.530Z","caller":"auth/range_perm_cache.go:114","msg":"user doesn't exist","user-name":"XXX","stacktrace":"go.etcd.io/etcd/server/v3/auth.(*authStore).isRangeOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/range_perm_cache.go:114\ngo.etcd.io/etcd/server/v3/auth.(*authStore).isOpPermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:870\ngo.etcd.io/etcd/server/v3/auth.(*authStore).IsRangePermitted\n\texternal/io_etcd_go_etcd_server_v3/auth/store.go:882\ngo.etcd.io/etcd/server/v3/etcdserver.checkTxnAuth\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:155\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Txn\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:166\ngo.etcd.io/etcd/server/v3/etcdserver.(*applierV3backend).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply.go:177\ngo.etcd.io/etcd/server/v3/etcdserver.(*authApplierV3).Apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/apply_auth.go:61\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntryNormal\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2241\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:2143\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1384\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1199\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\texternal/io_etcd_go_etcd_server_v3/etcdserver/server.go:1122\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\texternal/io_etcd_go_etcd_pkg_v3/schedule/schedule.go:157"}

etcd restart fixes the issue and no requests fail after that.

What did you expect to happen?

New node joins cluster and starts serving requests.

How can we reproduce it (as minimally and precisely as possible)?

WIP

Anything else we need to know?

might be related to https://github.com/etcd-io/etcd/pull/14227/files ?

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.5
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.18.6
Go OS/Arch: linux/amd64

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+----------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |      NAME      |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
| 3dd26ba725e108d1 | started |   dfw5a-rh6-2a | https://10.80.109.66:5100 | https://10.80.109.66:5000 |      false |
| 4438c22e9bd12975 | started |   dfw5a-rg3-5a |  https://10.80.98.60:5100 |  https://10.80.98.60:5000 |      false |
| 6b9cffc673ba60e1 | started | dfw5a-rh11-23b | https://10.80.122.23:5100 | https://10.80.122.23:5000 |      false |
| ca3b971d3167b0cf | started | dfw5a-rh11-21a | https://10.80.122.28:5100 | https://10.80.122.28:5000 |      false |
| cbfad084fd48d0cf | started |  dfw5a-rh11-8a | https://10.80.122.54:5100 | https://10.80.122.54:5000 |      false |
+------------------+---------+----------------+---------------------------+---------------------------+------------+
$ etcdctl --endpoints=<member list> endpoint status -w table
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|     ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.80.109.66:5000 | 3dd26ba725e108d1 |   3.5.5 |  1.5 GB |     false |      false |     11952 |  851807436 |          851807436 |        |
| 10.80.122.23:5000 | 6b9cffc673ba60e1 |   3.5.5 |  1.7 GB |     false |      false |     11952 |  851807436 |          851807436 |        |
| 10.80.122.28:5000 | ca3b971d3167b0cf |   3.5.5 |   12 GB |     false |      false |     11952 |  851807437 |          851807437 |        |
| 10.80.122.54:5000 | cbfad084fd48d0cf |   3.5.5 |  2.8 GB |      true |      false |     11952 |  851807437 |          851807437 |        |
|  10.80.98.60:5000 | 4438c22e9bd12975 |   3.5.5 |  602 MB |     false |      false |     11952 |  851807437 |          851807437 |        |
+-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (14 by maintainers)

Most upvoted comments

diff --git a/server/auth/store.go b/server/auth/store.go
index b436939b2..e745fb98e 100644
--- a/server/auth/store.go
+++ b/server/auth/store.go
@@ -377,6 +377,7 @@ func (as *authStore) Recover(be backend.Backend) {
        as.enabled = enabled
        if enabled {
                as.tokenProvider.enable()
+               as.refreshRangePermCache(tx)
        }
        as.enabledMu.Unlock()
 }

seems like fixed an issue.

During the troubleshooting I found that this issue lead to data inconsistency in the cluster. Is it possible new nodes with empty auth cache fail to apply entries from raft log due to auth failures but still advance the position in the log?

FYI me and @chaochn47 are working on adding linearizability test case for authorization. They should allows us to test other scenarios where injected failure could cause a authorization issue. Following known data inconsistency should also allow us to focus our work on features that had issues in past.

Looks good to me.

All related PRs are merged. I think it’s ready to release 3.4.22 and 3.5.6. I will release 3.4.22 sometime in the following week. Please add your comments or concerns in https://github.com/etcd-io/etcd/issues/14651 . @mitake @veshij @serathius @spzala @ptabor