envoy: Memory leak in dynamic forward proxy mode w/ DNS Cache
Description:
We have an envoy in a dynamic forward proxy mode. It usually behaves well, but sometimes it gets into a state where it tries to allocate 10s MiB/s and eventually gets OOM’ed:
Full heap profiles show that most of the memory is inside DnsCacheImpl (it should be limited to 10000 entries in theory though).
As a side note, I’ve also noticed that envoy tries to resolve internal search paths (
*.svc.cluster.local.), even thoughno_default_search_domainis set to true.
Heap dumps
Diff heap profile:
Raw profiles: https://gist.github.com/rbtz-openai/13d35aea14013f12273c6aa7478184cb
Admin and Stats Output:
"version": "b5ca88acee3453c9459474b8f22215796eff4dde/1.28.0/Clean/RELEASE/BoringSSL",
There are a lot of cluster due to dynamic forward proxy mode:
# curl -s localhost:XXX/clusters | fgrep -c hostname
6752
$ curl -s localhost:XXX/stats/prometheus | fgrep dns
# TYPE envoy_dns_cares_get_addr_failure counter
envoy_dns_cares_get_addr_failure{} 310
# TYPE envoy_dns_cares_not_found counter
envoy_dns_cares_not_found{} 178
# TYPE envoy_dns_cares_resolve_total counter
envoy_dns_cares_resolve_total{} 186188
# TYPE envoy_dns_cares_timeouts counter
envoy_dns_cares_timeouts{} 32
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_cache_load counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_cache_load{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_attempt counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_attempt{} 184881
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_failure counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_failure{} 667
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_success counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_success{} 184213
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_timeout counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_timeout{} 365
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_rq_pending_overflow counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_rq_pending_overflow{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_added counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_added{} 33253
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_address_changed counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_address_changed{} 90554
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_overflow counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_overflow{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_removed counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_removed{} 26501
# TYPE envoy_dns_cares_pending_resolutions gauge
envoy_dns_cares_pending_resolutions{} 1
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_open gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_open{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_remaining gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_remaining{} 1024
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts{} 6752
Config:
The interesting part of the config is the dynamic forward proxy with DNS Cache (same configuration in http_filters)
cluster_type:
name: envoy.clusters.dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
dns_cache_config:
name: dynamic_forward_proxy_cache_config
max_hosts: 10000
dns_lookup_family: V4_ONLY
typed_dns_resolver_config:
name: envoy.network.dns_resolver.cares
typed_config:
"@type": type.googleapis.com/envoy.extensions.network.dns_resolver.cares.v3.CaresDnsResolverConfig
resolvers:
- socket_address:
address: 8.8.8.8
port_value: 53
- socket_address:
address: 1.1.1.1
port_value: 53
- socket_address:
address: 8.8.4.4
port_value: 53
- socket_address:
address: 1.0.0.1
port_value: 53
dns_resolver_options:
use_tcp_for_dns_lookups: true
# There is no need to use the default search domain when resolving external requests
no_default_search_domain: true
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 1
- Comments: 16 (9 by maintainers)
sorry for the late reply, yeah, after removing brotli, dfp leak, if it exists, is not measurable at our scale.
Okay, I reproduced a leaking case on my side. Here are the key logs, with comments:
This leaking happens in
MainPrioritySetImpl::updateCrossPriorityHostMap, matches the info in the profile.For now, you could try the new
sub_cluster_config: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/dynamic_forward_proxy/v3/dynamic_forward_proxy.proto#envoy-v3-api-field-extensions-filters-http-dynamic-forward-proxy-v3-filterconfig-sub-cluster-config It will create sub strict_dns clusters for each host, with TTL enabled, instead of logical_dns.Here is a simple example that works on my side.
Also, I’ll try to create a PR for fixing the leaking in the logical_dns implementation.