etcd: etcd 3.5 db_size problem
What happened?
I have 3-node etcd cluster version 3.5.2. I noticed a sitituation that endpoint’s db_size is constantly growing. I have to perform compaction and defrag manually so that the db_size value cannot reach to limitation. I have not faced any similar problem in 3.2 version.
±-------------------±-----------------±--------±--------±----------±-----------±----------±-----------±-------------------±-------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | ±-------------------±-----------------±--------±--------±----------±-----------±----------±-----------±-------------------±-------+ | 10.201.64.106:2379 | 6af28eee6b8fd63a | 3.5.2 | 18 MB | true | false | 3 | 7509221 | 7509221 | | | 10.201.64.107:2379 | 8t2ae31d2c14413e | 3.5.2 | 18 MB | false | false | 3 | 7509221 | 7509221 | | | 10.222.82.121:2379 | c6131f42ed372576 | 3.5.2 | 18 MB | false | false | 3 | 7509221 | 7509221 | | ±-------------------±-----------------±--------±--------±----------±-----------±----------±-----------±-------------------±-------+
What did you expect to happen?
I expect the db size to not increase that fast. Or I shouldn’t do the defrag process manually.
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
# paste output here
$ etcdctl version
etcdctl version: 3.5.2
API version: 3.5
</details>
### Etcd configuration (command line flags or environment variables)
<details>
# paste your configuration here
--name milano01 \
--data-dir /var/lib/etcd \
--initial-advertise-peer-urls http://10.201.64.106:2380 \
--listen-peer-urls http://10.201.64.106:2380 \
--listen-client-urls http://10.201.64.106:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.201.64.106:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster milano01=http://10.201.64.106:2380,milano02=http://10.201.64.107:2380,milano03=http://10.222.82.121:2380 \
--initial-cluster-state new \
--heartbeat-interval 1000 \
--election-timeout 5000
</details>
### Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
<details>
```console
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
No response
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (12 by maintainers)
😄 For me - yes, we’ll take our issue elsewhere, I can’t speak for others
The timing of deploying the
defrag
daily job makes sense. It explains why you see the huge db increasing in the second diagram.The boltDB file can only be opened by one program. So you need to stop the etcd instance/POD when running the
etcd-dumo-db
toolWe have seen similar behaviour after upgrading v3.5.0 to v3.5.4.
ETCD is used as a backend for Kubernetes, the APIServer requests have not changed at all
But the ETCD DB size has started growing exponentially
The change in pattern also coincides with the time we have released this change: https://github.com/utilitywarehouse/tf_kube_ignition/commit/8ff0d098a92d4433a936157ca3ec208a11521058
Which is 19th Aug 05:00 UTC.