envoy: dns: DnsResolverImpl keeps using a "broken" c-ares channel

We have a STRICT_DNS type of a cluster defined in bootstrap config. In one of our test Pods, the membership count of this cluster became zero. This is understandable because the DNS resolution might have resulted in zero hosts. However this remained like this for quite a long time and after killing the container, Envoy is able to successfully resolve the DNS.

I have taken debug logs when Envoy is not able to resolve this. I see the following line

“source/common/network/dns_impl.cc:118] DNS request timed out 4 times”,

And I see these lines repeatedly “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 0 milliseconds”, “source/common/network/dns_impl.cc:147] Setting DNS resolution timer for 22 milliseconds”

So at this point I am not very clear if it is Envoy issue or container DNS issue - as container restart resolved the issue. Has any one seen similar issues with DNS? and another question is it the DNS resolution timer behaviour correct in the sense it is trying to resolve 0 milliseconds?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 30 (19 by maintainers)

Commits related to this issue

Most upvoted comments

Did some late night digging yesterday and arrived at an explanation:

When c-ares initializes a channel (trimming irrelevant details):

  1. It populates the DNS servers it will use to resolve queries from different places in ares_init_options.
  2. One of the functions is init_by_resolv_conf which has platform specific code.
  3. For iOS it falls into #elif defined(CARES_USE_LIBRESOLV) which uses res_getservers to get the addresses of DNS servers.
  4. In the absence of connectivity res_getservers returns AF_UNSPEC for the server address’ family.
  5. That means that the channel’s only server is then populated by init_by_defaults which uses INADDR_LOOPBACK:NAMESERVER_PORT as the servers address. There is obviously no guarantee that a DNS server is going to be running on loopback, and on the phone it is definitely not. In addition once a channel has been initialized it never re-resolves its server set, so even when connectivity is regained, the channel still only has the one default server.

Solution:

  1. Patch c-ares to “reinitialize” a channel based on certain conditions. After I understood the problem I dug through c-ares to see if this functionality was already available. It is not. However, there was a PR https://github.com/c-ares/c-ares/pull/272 that attempted to do this, albeit for only one platform, and on only one public function. That work could be finished in order to solve this issue. Opened an issue to track: https://github.com/c-ares/c-ares/issues/301
  2. In Envoy’s DnsResolverImpl detect when it is likely that DNS resolution is failing due to a “busted” channel and recreate it. https://github.com/envoyproxy/envoy/pull/9899

Envoy Mobile has the same issue in iOS.

Steps to repro: From an Envoy Mobile clone

  1. Build the iOS library: bazel build --config=ios //:ios_dist
  2. Turn laptop’s wifi off.
  3. Run the iOS example app: bazel run //examples/swift/hello_world:app --config=ios
  4. Envoy will start. DNS resolution will happen but the response will be empty.
  5. Turn wifi back on.
  6. Even after 5+ minutes DNS resolution still returns an empty response.

Config used: This is repro’ed with clusters with both STRICT and LOGICAL DNS. As well as the dynamic forward proxy. The DNS refresh rate was configured to be 5s.

I am going to be looking at this issue as the setup above repros this issue 100% of the time.