envoy: Envoy OOM problem with TLS
Hi, we are running into envoy.server.memory_heap_size unbounded growth problem, causing the kernel to kill envoy process. Below are the details:
We had a lightstep-collector deployment in a 4 node cluster (r4.large EC2 instances, with 15G RAM), each of which also had envoyproxy running on it (without processing any traffic). The collectors directly received system traces (from about 40 servers) on a secure port over grpc. The hosts were running with about 9GB system memory utilization.
Then we added a TLS listener to the envoyproxy on the same 4 hosts to intercept the traces and route them to the localhost lightstep collector, on a plaintext port. The traffic was correctly going via envoy to the collector. But after a point envoy crashed due to out of memory.
Looking at envoy.server.memory_heap_size we see it linearly increasing at 100MB/hr for about 6 hours, and then going up at a faster rate to reach ~9GB in under 12 hours, at which point the kernel killed envoy due to out of memory (system & the other processes accounting for remaining memory).
Is there a memory leak in envoy or is there a config I can set to throttle envoy or control memory buffers? I am not reporting a bug because I am not sure if this is a config issue.
Over the whole period, the CPU utilization on the 4 hosts was fairly low, hovering around 18%.
The relevant envoy config is:
static_resources:
listeners:
# Local Application gRPC listener
- address:
socket_address:
address: 0.0.0.0
port_value: 15149
filter_chains:
- filters:
- name: envoy.http_connection_manager
config:
codec_type: auto
stat_prefix: grpc
route_config:
name: local_grpc_route
virtual_hosts:
- name: service
domains:
- "*"
routes:
- match:
prefix: "/"
route:
cluster: grpc
http_filters:
- name: envoy.router
config: {}
access_log:
- name: envoy.file_access_log
config:
path: /var/log/envoy/grpc_access.log
tls_context:
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /path/to.bundle.pem
private_key:
filename: /path/to.key.pem
clusters:
- name: grpc
connect_timeout: 0.25s
type: static
lb_policy: round_robin
http2_protocol_options: {}
hosts:
- socket_address:
address: 127.0.0.1
port_value: 5150
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 19 (10 by maintainers)
@mattklein123 I’ve linked a discussion I started on the contour repository related to the observed memory consumption of envoy when adding tls routes. It feels related to this issue. I’ve posted heap tracing reported memory leaks. If I believe output of
pprof top, I’m seeing GBs of leaks?