moby: Wrong behaviour of the DNS resolver within Swarm Mode overlay network

I’m trying to setup a mongodb shard with 2 shards and each shards being a replicaset (size=2). I have one mongos router and one replicaset (size=2) of config dbs.

I was getting plenty of errors about chunks migration and after digging I figured out that the target host was sometimes alive and sometimes not. But the containers where not crashed which was strange.

After digging deeper I figured out that the IP addresses got through the resolution were not right.

Please note that each service is running in dnsrr mode.

Steps to reproduce the issue: hard to get exactly the behaviour

  1. docker network create -d overlay public
  2. Create a service within the network on different machines
  3. Make the containers crash and let them restart, repeat multiple items
  4. Scale the service to 0
  5. Place a container within the same network and nslookup the service your created earlier

Describe the results you received:

nslookup mongo-shard-rs0-2
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	mongo-shard-rs0-2
Address: 10.0.0.2
Name:	mongo-shard-rs0-2
Address: 10.0.0.5
Name:	mongo-shard-rs0-2
Address: 10.0.0.4

Describe the results you expected:

** server can't find mongo-shard-rs0-2: NXDOMAIN

Additional information you deem important (e.g. issue happens only occasionally): Random issue. If I restart the machine then it works again. Seems the cache is poisoned or something.

Output of docker version:

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 2
 Running: 1
 Paused: 0
 Stopped: 1
Images: 6
Server Version: 1.12.6
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: overlay null bridge host
Swarm: active
 NodeID: 0466d9iww2nsv3crnj2yd6vk4
 Is Manager: false
 Node Address: 172.16.1.13
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-59-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 6.65 GiB
Name: worker2
ID: ZCIH:LHBN:PVQN:O75R:CVMI:R2PK:35O6:ENF6:UTBN:AY2K:IJZQ:DEQK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
 provider=generic
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.): 4 machines running on Ubuntu within OpenStack. 1 manager DRAIN LEADER 2 for mongod shards and replicas 1 worker which has mongos

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 9
  • Comments: 74 (33 by maintainers)

Most upvoted comments

I am experiencing this issue on a swarm hosting multiple service stacks. Occasionally, after removing stacks, containers crashing, or services being scaled down and then back up, DNS resolution inside a container for another service will return additional incorrect results. This completely hoses our setup when it happens to the service hosting our reverse proxy, as requests are proxied to incorrect addresses.

Our swarm is running 1.13.1. Each service has certain containers that connect to a “public” overlay network which also is what our proxy service is connected to. It’s within this overlay network that I see this error occurring.

What I typically see is that a service is running at an IP address, say, 10.0.0.3, and then it gets moved (after being scaled or redeployed) to another IP address, like 10.0.0.12. However, DNS lookup on this service (nslookup stack_servicename) still returns the old IP address in addition to the new one.

I see similar issue. I’m running 4 nodes (1 manager and 3 workers). When there’s a load on node and for some reason services start to crashing causing chain reaction for other services to restart, which makes them placed from one node to the other swarm nodes and back. I noticed that new (just created ~3 mins ago) containers can’t resolve other containers after they’ve been moved. I tried to resolve name of container, which was not moved (or restarted) and has a placement constraint (so there’s no way this container was restarted and DNS record is missing because it was deleted during shutdown). I can’t reproduce it in synthetic environment, just to provide an example, though I see this each time I try to run my services.

# docker version
Client:
 Version:      17.04.0-ce
 API version:  1.28
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 17:45:49 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.04.0-ce
 API version:  1.28 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 17:45:49 2017
 OS/Arch:      linux/amd64
 Experimental: true

UPD: It’s not really hard to reproduce. Try to deploy using the following docker-compose file: https://gist.github.com/velimir0xff/28da8e16e01475b2a95f9ac74c069aa0

Sorry that I have to add a comment to this thread, because the problem is not solved yet.

I use Docker 18.09.0, in a swarm with 4 master nodes on ubuntu 16.0.4 at digital ocean droplets. In most cases the swarm behavior fine. Sometimes the internal DNS answers with two service IPs for one service, after several docker service rm <service_name> and docker stack deploy … command. The services are connected to one overlay network.

One of the IP is an old never more existing service instance, the other one is the correct of the actual healthy service. The services are always as vip service available. The reverse proxy in a nginx container, tries to access both, one goes wrong, one succeed.

To resolve this situation, I found no way without removing the overlay network to reset the internal DNS. Tried docker service update --force, docker service rm …, docker stack rm, docker service scale service=0, and so on.

Perhaps there is a race condition to update the internal entries for service endpoints. But it should be a better solution as shutting down all stacks, remove and recreate the network and starting the stacks again.

A docker network reinit-dns command would be a solution on resolving this issues. I had assumed that a docker stack rm … would clean all internal DNS entries.

Thanks

@remingtonc the advantage of using the VIP instead of the container IP is that with the VIP you don’t care how many instances of the service are running behind it, it can be 1 or 10 but your application will reach the service using the same IP. You can also choose to use instead of the VIP the dns RR mode, and basically every time you resolve the service name you will have the list of containers ordered in round robin fashion. This means that you have to be aware that if the container that you are talking to goes down you will need to do another dns resolution and be sure that the previous results did not get cached.

From a debugging point of view, the tool to use are:

  1. tcpdump
  2. ipvsadm --> load balancer is done with ipvs

if you want the task ip list you need to do tasks.<service_name> so will be nslookup tasks.web

@fcrisciani yes it 100% reproducible. I did it twice just now.

First I’ve removed all services, containers, networks, and ran docker system prune -a -f

Then I’ve created network, service and container:

docker network create --driver overlay --subnet 10.10.1.0/24 --opt encrypted --attachable services
docker service create --name test1 --constraint node.hostname==CURRENT_HOST --network services nginx
docker run --restart always --name test2 --privileged --network services -d nginx

Then in container test1 ran nslookup: docker exec -it test1.xxxxxxx /bin/bash

apt update && apt install -y dnsutils
nslookup test2

Got this:

root@e4a65d8b45c8:/# nslookup test2
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   test2
Address: 10.10.1.8

After test2 container restart: docker container restart test2

Tried nslookup again and got this:

root@e4a65d8b45c8:/# nslookup test2
Server:         127.0.0.11
Address:        127.0.0.11#53

Non-authoritative answer:
Name:   test2
Address: 10.10.1.9
Name:   test2
Address: 10.10.1.8

My cluster has 5 nodes, all managers.

@zzvara thanks for sharing possible workaround, I would also add the following: To help in recovering this situation we created: https://github.com/docker/libnetwork/blob/master/cmd/diagnostic/README.md a tool that allows to remove entries directly from the database.

Also I suggest to move to the next stable release, the diff since 17.09 is pretty big.

I seems to have a similar / same issue.

When restoring a database (on 3 mongo containers in replicaset over 3 managers…for what it matters) the host/manager because unavailable. (Docker AWS with 3 t2.medium managers, no workers). Whilst the restoring is in progress I can barely ssh into the manager.

Problem persists with 17.06-rc4.

What I’ve noticed is that the problem seems to only happens when I deploy a second identical stack (obviously under a different name) and run a mongorestore on the second stack. Initially I thought it would be some kind of conflict between the two stacks but my understanding is that they are completely isolated. Is that correct?

Possibly related to #32841

@remingtonc the issue tracked here is a completely different one. In this issue attachable container were not being deleted from the dns resolver. I went through your message and to me is not clear what is the expectation from a resolver point of view. 10.0.0.5 does not correspond to any running container so the call as it fails the resolution from the local name server is propagate to the external one. On the other side 10.0.0.6 is instead resolved to one of the running containers. If at the end of the diagnostic tool you add the keyword unsafe like: curl 'localhost:50015/gettable?tname=endpoint_table\&nid=55f0y5igh52g0qiuy9bi1i2uc&unsafe' you will have a string version of the data structure so that you can look at the fields value. It’s called unsafe because just prints the binary as a string so some characters may not be printable. We have anyway another utility the clientDiagnostic that does the proper deserialization of the data structure if needed.

thanks @denis-isaev this is super useful. Confirmed the behavior and looks like this PR (https://github.com/docker/libnetwork/pull/2176) did not get properly ported in the 18.03.1-ce code base… Will follow up on that internally. I checked also recent releases like 18.04/18.05 they have the same issue (patch missing). The nightly packages have the proper fix if you cannot wait for 18.06 that should come soon. Sorry for the trouble…

We are using CoreOS stable channel with Docker version 17.09.0-ce, build afdb6d4. Docker has a great list of weaknesses when something unexpected happens during engine startup or when multiple containers managed at the same time. Most of these issues can take down a single down, forcing operators to remove the node from the swarm. The problem mentioned here might take down services on the same network completely, causing cluster-wide problems. The DNS registry gets multiple entries of the same name, and the resolver behaves randomly.

We solved it by manually looking up containers that have been registered multiple times and disconnected them from the network. However, one DNS entry still remained for each container. Then, we reconnected the container to the network but also specifying the IP address of the “stuck” DNS entry for that name, using --ip flag on docker network connect. This way we did not have to take down the whole network and services with it to repair it.

@danielmrosa ok if the nslookup returned 2 VIPs then ok that is then possible. What can happen there is that the service on leader election gets assigned a new VIP different from the original one. The reason is that during the state reconstruction on the leader, because of the missing EndpointSpec, instead of being marked as in use the original VIP the logic tries to release it, so you should see on the manger logs like failed to deallocate VIP. Once you do service update I believe that the service gets a potential new VIP and for that reason int he nslookup you find 2 VIPs there and the platform lost track of the original one. If this is the case you should be good to go with the fix. The reconstruction logic will take place correctly also if EndpointSpec is missing

Hi @fcrisciani Sorry to not have a repro steps, just a few information about our development scenario. We have 3 manager nodes ( availability drain) and 5 workers node. We have Jenkins doing many service updates all the time on this environment and to be honest, it´s not a stable environment because heavy loads and network instability can occurs. So, yesterday I figured out that one container was not working properly and after some investigation, I see two entries in the DNS. 1 record pointing to the right VIP and another pointing to a nonexistent VIP ( I saw this doing a service inspect in all tasks) and this container was created with empty EndpointSpec parameter. After a stack rm, one IP get stucked on DNS database and it seems that there is no way to fix it unless a reboot on manager nodes.

Yes, you´re right the fix is related to IP overlap and not extra entries in the DNS. We can monitor now if with this endpoint_mode parameter will avoid this too.

Thanks a lot for your usual support. As soon I have more information or repro steps, I let you informed.

@thaJeztah and @fcrisciani I see that progress has been made on https://github.com/docker/libnetwork/issues/1934 which, judging from the description, “smells” a bit of what could be our issue. Basically, already “taken” IPs are being handed out resulting in multiple services (load balancers) with the same IP.

I see that this patch is “stuck” as it needs review. If this is the patch, how long should we expect before this patch to go into the ce edition?

The issue we’re seeing is causing a lot of noise in our development and production environments. Services gets their IPs mixed up on an daily/hourly basis, and the only remedy so far is to either downscale services to 0 and then up again (which works sometimes) or reboot entire Docker hosts.

I’m editing this post again as we’re seeing this issue more and more. We’re seeing duplicate IPs for load balancers representing services that runs on the same port. That is, it’s always load balancers on the same port within the same network that gets mixed up. This seems to be the rule.