rancher: Dnsmasq running on the host will result in Rancher DNS container fail to do recursive DNS query

Network services weren’t coming up, and timeout getting IP is a generic error message. Turns out dnsmasq was running on the agent host and bound to port 53.

metadata-1 logs

8/2/2017 7:21:33 PMtime="2017-08-03T02:21:33Z" level=info msg="Subscribing to events"
8/2/2017 7:21:33 PMtime="2017-08-03T02:21:33Z" level=fatal msg="Failed to subscribeGet https://<myRancherURI>/v2-beta: dial tcp: lookup <myRancherURI> on 127.0.0.1:53: read udp 127.0.0.1:38353->127.0.0.1:53: read: connection refused"
8/2/2017 7:21:38 PMtime="2017-08-03T02:21:38Z" level=info msg="Subscribing to events"
8/2/2017 7:21:38 PMtime="2017-08-03T02:21:38Z" level=fatal msg="Failed to subscribeGet https://<myRancherURI>/v2-beta: dial tcp: lookup <myRancherURI> on 127.0.0.1:53: read udp 127.0.0.1:57495->127.0.0.1:53: read: connection refused"

metadata-dns logs were empty

seems dns binds 6060, 80 and 53… we should also document those in the agent host docs.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 24 (18 by maintainers)

Most upvoted comments

In order to skip the default check, you’ll need to run the docker run rancher/agent command with -e CATTLE_CHECK_NAMESERVER=false

@aiwantaozi

We cannot set any arbitrary DNS for the customer.

Let’s take a look like what we have now.

What we observed:

  1. No DNSMASQ, no problem at all.
  2. Start DNSMASQ after Rancher DNS/Metadata service, no problem.
  3. Start DNSMASQ before Rancher DNS/Metadata service, and Rancher server is configured using IP as the endpoint of the website, no issue.
  4. Start DNSMASQ before Rancher DNS/Metadata service, and Rancher server is configured using a DNS name as the endpoint, metadata service reports error.

Analysis:

  1. The issue happens because of DNSMASQ service on the host. But it’s not due to port conflict since every port we’re running DNS service on is in the container, not on the host. So there won’t be a port conflict.
  2. The reason for the issue is that DNSMASQ on the host changed the /etc/resolv.conf. Since DNSMASQ server is running, /etc/resolv.conf will be modified to point to 127.0.0.1
  3. After rancher-dns service started, it will pick up the /etc/resolv.conf in the host, and use it inside the container. It is the standard behavior of Docker container (unless someone like Rancher update it later). So it will use 127.0.0.1 inside the container network namespace, and it won’t work because the host DNS server is in another network namespace.
  4. It’s also interesting that 127.0.0.1 can be rancher-dns itself if it listens to all IPs. I think the behavior of listening address has changed in 1.6.10.

Solution:

We can prohibit using 127.0.0.1 in the host /etc/resolv.conf by checking it when rancher-dns started.

If the user wants to use DNSMASQ on the host, it seems we can recommend user to add DNSMASQ_EXCEPT=lo to /etc/defaults/dnsmasq to avoid DNSMASQ make itself as the only DNS server on the host. Reference: https://superuser.com/questions/894513/resolv-conf-keeps-getting-overwritten-when-dnsmasq-is-restarted-breaking-dnsmas