etcd: Memory leak with distributed tracing enabled
What happened?
Adding --experimental-enable-distributed-tracing
works, but causes a memory leak, of about 1GB per hour in our setup. Instead of expected ~2GB, it got to about 12GB in 7 hours.
What did you expect to happen?
Stable memory usage around 1.8-2.0 GB of RSS.
How can we reproduce it (as minimally and precisely as possible)?
Run with --experimental-enable-distributed-tracing
for few hours. It is sufficient to enable it on one member.
Anything else we need to know?
The tracing collector endpoint doesn’t need to be configured or listening. Having otelcol
on 4317 doesn’t change anything (beyond actually making tracing work).
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05
Go Version: go1.16.6
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5
Etcd configuration (command line flags or environment variables)
etcd, Kubernetes, OKD / Openshift 4.9, 3 members.
etcd --experimental-enable-distributed-tracing --logger=zap --log-level=info --initial-advertise-peer-urls=https://10.10.0.102:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com..crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.10.0.102:2379 --listen-client-urls=https://0.0.0.0:2379,unixs://10.10.0.102:0 --listen-peer-urls=https://0.0.0.0:2380 --metrics=extensive --listen-metrics-urls=https://0.0.0.0:9978
running in cri-o
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
[root@master-1 /]# etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| 10f8cf6269xxx | started | master-2.example.com | https://10.10.0.103:2380 | https://10.10.0.103:2379 | false |
| a2bbe7149xxx | started | master-1.example.com | https://10.10.0.102:2380 | https://10.10.0.102:2379 | false |
| acb2c160xxx | started | master-0.example.com | https://10.10.0.101:2380 | https://10.10.0.101:2379 | false |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
Relevant log output
No fatal issues in the logs.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (10 by maintainers)
I can confirm its not an upstream binary, this is the downstream repo the build comes from[1]. The changes would be minimal to etcd itself. 3.5.0 uses a pretty old version of otel (pre v1) so its possible that they had a bug as well.
[1] https://github.com/openshift/etcd
Great, closing issue as fixed. Thanks for looking into this.
@serathius I was not able to reproduce the issue.
Sure, I can try on Monday to test it.