envoy: Memory leak in dynamic forward proxy mode w/ DNS Cache

Description: We have an envoy in a dynamic forward proxy mode. It usually behaves well, but sometimes it gets into a state where it tries to allocate 10s MiB/s and eventually gets OOM’ed: Screenshot 2023-11-21 at 11 22 01 AM

Full heap profiles show that most of the memory is inside DnsCacheImpl (it should be limited to 10000 entries in theory though).

As a side note, I’ve also noticed that envoy tries to resolve internal search paths (*.svc.cluster.local.), even though no_default_search_domain is set to true.

Heap dumps Diff heap profile: profile001

Raw profiles: https://gist.github.com/rbtz-openai/13d35aea14013f12273c6aa7478184cb

Admin and Stats Output:

 "version": "b5ca88acee3453c9459474b8f22215796eff4dde/1.28.0/Clean/RELEASE/BoringSSL",

There are a lot of cluster due to dynamic forward proxy mode:

# curl -s localhost:XXX/clusters | fgrep -c hostname
6752
$ curl -s localhost:XXX/stats/prometheus | fgrep dns
# TYPE envoy_dns_cares_get_addr_failure counter
envoy_dns_cares_get_addr_failure{} 310
# TYPE envoy_dns_cares_not_found counter
envoy_dns_cares_not_found{} 178
# TYPE envoy_dns_cares_resolve_total counter
envoy_dns_cares_resolve_total{} 186188
# TYPE envoy_dns_cares_timeouts counter
envoy_dns_cares_timeouts{} 32
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_cache_load counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_cache_load{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_attempt counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_attempt{} 184881
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_failure counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_failure{} 667
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_success counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_success{} 184213
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_timeout counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_query_timeout{} 365
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_rq_pending_overflow counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_dns_rq_pending_overflow{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_added counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_added{} 33253
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_address_changed counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_address_changed{} 90554
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_overflow counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_overflow{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_host_removed counter
envoy_dns_cache_dynamic_forward_proxy_cache_config_host_removed{} 26501
# TYPE envoy_dns_cares_pending_resolutions gauge
envoy_dns_cares_pending_resolutions{} 1
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_open gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_open{} 0
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_remaining gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_circuit_breakers_rq_pending_remaining{} 1024
# TYPE envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts gauge
envoy_dns_cache_dynamic_forward_proxy_cache_config_num_hosts{} 6752

Config:

The interesting part of the config is the dynamic forward proxy with DNS Cache (same configuration in http_filters)

    cluster_type:
      name: envoy.clusters.dynamic_forward_proxy
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
        dns_cache_config:
          name: dynamic_forward_proxy_cache_config
          max_hosts: 10000
          dns_lookup_family: V4_ONLY
          typed_dns_resolver_config:
            name: envoy.network.dns_resolver.cares
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.network.dns_resolver.cares.v3.CaresDnsResolverConfig
              resolvers:
                - socket_address:
                    address: 8.8.8.8
                    port_value: 53
                - socket_address:
                    address: 1.1.1.1
                    port_value: 53
                - socket_address:
                    address: 8.8.4.4
                    port_value: 53
                - socket_address:
                    address: 1.0.0.1
                    port_value: 53
              dns_resolver_options:
                use_tcp_for_dns_lookups: true
                # There is no need to use the default search domain when resolving external requests
                no_default_search_domain: true

cc: @alyssawilk @euroelessar

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 1
  • Comments: 16 (9 by maintainers)

Commits related to this issue

Most upvoted comments

sorry for the late reply, yeah, after removing brotli, dfp leak, if it exists, is not measurable at our scale.

Okay, I reproduced a leaking case on my side. Here are the key logs, with comments:

# ip changed, host_info_ is also changed
[2023-12-08 22:36:00.952][83629][debug][forward_proxy] [source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc:404] host 'test.s
ervice:80' address has changed from 210.209.160.193:80 to 166.137.92.127:80

# expired, removing the new host_info_, the old address is leaking
[2023-12-08 22:36:05.952][83629][debug][forward_proxy] [source/extensions/common/dynamic_forward_proxy/dns_cache_impl.cc:262] host='test.s
ervice:80' TTL expired, removing

This leaking happens in MainPrioritySetImpl::updateCrossPriorityHostMap, matches the info in the profile.

For now, you could try the new sub_cluster_config: https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/dynamic_forward_proxy/v3/dynamic_forward_proxy.proto#envoy-v3-api-field-extensions-filters-http-dynamic-forward-proxy-v3-filterconfig-sub-cluster-config It will create sub strict_dns clusters for each host, with TTL enabled, instead of logical_dns.

Here is a simple example that works on my side.

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 7010
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: dynamic_forward_proxy_cluster
          http_filters:
          - name: envoy.filters.http.dynamic_forward_proxy
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.FilterConfig
              sub_cluster_config:
                cluster_init_timeout: 10s
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: dynamic_forward_proxy_cluster
    connect_timeout: 1s
    lb_policy: CLUSTER_PROVIDED
    dns_resolvers:
    - socket_address:
        address: 127.0.0.1
        port_value: 5300
    cluster_type:
      name: envoy.clusters.dynamic_forward_proxy
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
        sub_clusters_config:
          lb_policy: round_robin
          max_sub_clusters: 100
          sub_cluster_ttl: 10s

Also, I’ll try to create a PR for fixing the leaking in the logical_dns implementation.