etcd: etcd 3.5 db_size problem

What happened?

I have 3-node etcd cluster version 3.5.2. I noticed a sitituation that endpoint’s db_size is constantly growing. I have to perform compaction and defrag manually so that the db_size value cannot reach to limitation. I have not faced any similar problem in 3.2 version.

±-------------------±-----------------±--------±--------±----------±-----------±----------±-----------±-------------------±-------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | ±-------------------±-----------------±--------±--------±----------±-----------±----------±-----------±-------------------±-------+ | 10.201.64.106:2379 | 6af28eee6b8fd63a | 3.5.2 | 18 MB | true | false | 3 | 7509221 | 7509221 | | | 10.201.64.107:2379 | 8t2ae31d2c14413e | 3.5.2 | 18 MB | false | false | 3 | 7509221 | 7509221 | | | 10.222.82.121:2379 | c6131f42ed372576 | 3.5.2 | 18 MB | false | false | 3 | 7509221 | 7509221 | | ±-------------------±-----------------±--------±--------±----------±-----------±----------±-----------±-------------------±-------+

What did you expect to happen?

I expect the db size to not increase that fast. Or I shouldn’t do the defrag process manually.

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
etcdctl version: 3.5.2
API version: 3.5

</details>


### Etcd configuration (command line flags or environment variables)

<details>

# paste your configuration here
 --name milano01 \
 --data-dir /var/lib/etcd \
 --initial-advertise-peer-urls http://10.201.64.106:2380 \
 --listen-peer-urls http://10.201.64.106:2380 \
 --listen-client-urls http://10.201.64.106:2379,http://127.0.0.1:2379 \
 --advertise-client-urls http://10.201.64.106:2379 \
 --initial-cluster-token etcd-cluster-1 \
 --initial-cluster milano01=http://10.201.64.106:2380,milano02=http://10.201.64.107:2380,milano03=http://10.222.82.121:2380 \
 --initial-cluster-state new \
 --heartbeat-interval 1000 \
 --election-timeout 5000

</details>


### Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (12 by maintainers)

Most upvoted comments

😄 For me - yes, we’ll take our issue elsewhere, I can’t speak for others

The timing of deploying the defrag daily job makes sense. It explains why you see the huge db increasing in the second diagram.

  1. The disk space wasn’t reclaimed at all before deploying the defrag daily job at Sep 7, so you saw the continuous db size increase;
  2. The different amplitude between two environments might be due to different traffic rate.

I had a go using etcd-dump-db, but so far getting:

$ sudo /tmp/./etcd-dump-db list-bucket /var/lib/etcd/
2022/09/14 22:47:02 failed to open bolt DB timeout

The boltDB file can only be opened by one program. So you need to stop the etcd instance/POD when running the etcd-dumo-db tool

We have seen similar behaviour after upgrading v3.5.0 to v3.5.4.

ETCD is used as a backend for Kubernetes, the APIServer requests have not changed at all

2022-08-31-173657_2560x1273_scrot

But the ETCD DB size has started growing exponentially

2022-08-31-173715_2560x942_scrot

The change in pattern also coincides with the time we have released this change: https://github.com/utilitywarehouse/tf_kube_ignition/commit/8ff0d098a92d4433a936157ca3ec208a11521058

commit 2e8d2dae6396193a1cfd2dbe6161cbef2750b870
Author: George Angel <george-angel@users.noreply.github.com>
Date:   Thu Aug 18 19:01:30 2022 +1000

    Deployed fixed version of ETCD | prod-aws (#9666)

Which is 19th Aug 05:00 UTC.