libnetwork: Cannot connect to a container in an overlay network from a different swarm node: `could not resolve peer \"\": timed out resolving peer by querying the cluster`

Description of problem:

Very rarely (observed twice after using 1000s of containers) we start a new container into an overlay network in a docker swarm. Existing containers in the overlay network that are on different nodes cannot connect to the new container. However containers in the overlay network on the same node as the new container are able to connect.

The new container receives an IP address in the overlay network subnet, but this does not seem to work correctly when resolved from a different node.

The second time this happened we fixed the problem by stopping and starting the new container.

We haven’t found a way to reliably reproduce this problem. Is there any other debugging I can provide that would help diagnose this issue?

The error message is the same as the one reported on https://github.com/docker/libnetwork/issues/617.

docker version:

Client:
 Version:      1.10.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   590d5108
 Built:        Thu Feb  4 19:04:33 2016
 OS/Arch:      linux/amd64

Server:
 Version:      swarm/1.1.0
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   a0fd82b
 Built:        Thu Feb  4 08:55:18 UTC 2016
 OS/Arch:      linux/amd64

docker info:

Containers: 102
 Running: 53
 Paused: 0
 Stopped: 49
Images: 372
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 3
 glera.int.corefiling.com: 10.0.0.57:2375
  └ Status: Healthy
  └ Containers: 32
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 32.94 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:16Z
 kafue.int.corefiling.com: 10.0.0.17:2375
  └ Status: Healthy
  └ Containers: 36
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 16.4 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:20Z
 paar.int.corefiling.com: 10.0.1.1:2375
  └ Status: Healthy
  └ Containers: 34
  └ Reserved CPUs: 0 / 4
  └ Reserved Memory: 0 B / 16.44 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=4.2.6-201.fc22.x86_64, operatingsystem=Fedora 22 (Twenty Two), storagedriver=devicemapper
  └ Error: (none)
  └ UpdatedAt: 2016-02-22T11:20:31Z
Plugins:
 Volume:
 Network:
Kernel Version: 4.2.6-201.fc22.x86_64
Operating System: linux
Architecture: amd64
CPUs: 12
Total Memory: 65.77 GiB
Name: 9dd94ffb6aea

uname -a:

Linux glera.int.corefiling.com 4.2.6-201.fc22.x86_64 #1 SMP Tue Nov 24 18:42:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Environment details (AWS, VirtualBox, physical, etc.):

Physical - docker swarm cluster.

How reproducible:

Rare - happened 2 times after creating/starting 1000s of containers.

Steps to Reproduce:

Create/start a container in an overlay network
In the same overlay network create/start a container on a different host in the swarm. A process in the container is listening on port 80 and this port is exposed to the overlay network.
Try to connect to the container of step 2 from within the container of step 1 with http client.

Actual Results:

Get a connection timeout. For example with the golang http client:

 http: proxy error: dial tcp 10.158.0.60:80: i/o timeout

10.158.0.60 is the address of the container in step 2 in the overlay network subnet.

The docker logs on the swarm node that launched the container in step 2 contain (from journalctl -u docker):

level=error msg="could not resolve peer \"<nil>\": timed out resolving peer by querying the cluster".

We see a line like this for each failed request between the containers.

When we make the same request from a container in the overlay network on the same swarm node as the container running the http server the expected connection is established and a response is received.

Expected Results:

The http client receieves a response from the container its trying to connect to.

Additional info:

The second time this occurred we fixed the problem by stopping and starting the container running the http server.

We are using Consul as the KV store of the overlay network and swarm.

When removing the container that cannot be connected to, docker logs (journalctl -u docker) contain the line:

error msg="Peer delete failed in the driver: could not delete fdb entry into the sandbox: could not delete neighbor entry: no such file or directory\n"

The docker log lines are emitted by https://github.com/docker/libnetwork/blob/master/drivers/overlay/ov_serf.go#L180. I can’t find an existing issue tracking this.

About this issue

Original URL
State: open
Created 8 years ago
Comments: 28 (1 by maintainers)

Most upvoted comments

As mentioned above, we have been working around this issue by periodically running ping from the container that can not be reached back in the opposite direction. In our case a Nginx container often can’t connect to a backend container to proxy HTTP requests. All of these containers are on the same frontend network. We setup a Jenkins job that runs every 5 minutes (could be a cron job too) that finds all of the containers on the frontend network and execs into them to ping the nginx container:

export DOCKER_HOST=swarm-manager:3375
CONTAINERS=$(docker network inspect frontend | grep "\"Name\":" | grep -v "\"frontend\"" | grep -v "\"nginx\"" | sort | cut -d "\"" -f 4)
for CONTAINER in $CONTAINERS; do 
  docker exec -i ${CONTAINER} ping nginx -c 1
done

This seems to keep the VXLAN working (and/or resolve the issue when it does happen) without having to recreate containers or restart anything.

tfasz on Jun 9, 2017

@tfasz @antoinetran this behavior got fixed by https://github.com/docker/libnetwork/pull/1792 will be available on 17.06 and also will be backported to 17.03

fcrisciani on Jun 13, 2017

Dear all,

We migrated 3 Swarm clusters to docker-ce-17.06.0, 2 weeks ago, and this seems to work fine, until now. I just reproduced this error once. I had to ping back to have the connectivity again. But it seems this error is more rare now.

Any info/log someone want?

antoinetran on Jul 20, 2017

We have the same issue in a docker swarm with thousands of containers. In all the docker nodes (1.10.3) we have an nginx containers which can communicate with different application containers in the overlay network (using consul). Sometimes one of the nginx containers cannot connect to an app container in a different node, receiving the same message in the log:

ERRO[5758943] could not resolve peer "<nil>": timed out resolving peer by querying the cluster

Additionally we are seeing that the app containers which fail always have the same IP addresses. Restarting the app container doesn’t work if the container received the same IP address. What work for us is:

Stop docker (where the nginx container is running), clean iptables, start docker
Clean the network namespaces in the docker node where the nginx container is running. This work but not inmediatly
After some hours the IP address start to answer without any apparent reason.

We tried to debug where the packets are going without success. There is not activity in the remote node so we think the packet is not leaving the node where the nginx container is running.

hdanniel on Apr 8, 2017