etcd: Memory leak with distributed tracing enabled

What happened?

Adding --experimental-enable-distributed-tracing works, but causes a memory leak, of about 1GB per hour in our setup. Instead of expected ~2GB, it got to about 12GB in 7 hours.

What did you expect to happen?

Stable memory usage around 1.8-2.0 GB of RSS.

How can we reproduce it (as minimally and precisely as possible)?

Run with --experimental-enable-distributed-tracing for few hours. It is sufficient to enable it on one member.

Anything else we need to know?

The tracing collector endpoint doesn’t need to be configured or listening. Having otelcol on 4317 doesn’t change anything (beyond actually making tracing work).

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05
Go Version: go1.16.6
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd, Kubernetes, OKD / Openshift 4.9, 3 members.

etcd --experimental-enable-distributed-tracing --logger=zap --log-level=info --initial-advertise-peer-urls=https://10.10.0.102:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com..crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.10.0.102:2379 --listen-client-urls=https://0.0.0.0:2379,unixs://10.10.0.102:0 --listen-peer-urls=https://0.0.0.0:2380 --metrics=extensive --listen-metrics-urls=https://0.0.0.0:9978

running in cri-o

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

[root@master-1 /]# etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |             NAME             |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| 10f8cf6269xxx | started | master-2.example.com | https://10.10.0.103:2380 | https://10.10.0.103:2379 |      false |
| a2bbe7149xxx | started | master-1.example.com | https://10.10.0.102:2380 | https://10.10.0.102:2379 |      false |
| acb2c160xxx | started | master-0.example.com | https://10.10.0.101:2380 | https://10.10.0.101:2379 |      false |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+

Relevant log output

No fatal issues in the logs.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (10 by maintainers)

Most upvoted comments

It looks like Openshift customized the etcd? @hexfusion could you confirm this?

I can confirm its not an upstream binary, this is the downstream repo the build comes from[1]. The changes would be minimal to etcd itself. 3.5.0 uses a pretty old version of otel (pre v1) so its possible that they had a bug as well.

[1] https://github.com/openshift/etcd

Great, closing issue as fixed. Thanks for looking into this.

@serathius I was not able to reproduce the issue.

We have already bumpped the otel to 1.0.1 in #14312. @baryluk Could you please double check whether you can still see this issue? thx

Sure, I can try on Monday to test it.