moby: Docker Swarm node starts routing requests to wrong containers

Output of docker version:

Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:22:43 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 05:22:43 2016
 OS/Arch:      linux/amd64

Output of docker info:

Containers: 33
 Running: 3
 Paused: 0
 Stopped: 30
Images: 15
Server Version: 1.12.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 277
 Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay null host
Swarm: active
 NodeID: 16e15nh1rf84uf6k5h6czxmp9
 Is Manager: true
 ClusterID: 9du6skrzpon6x3dfv8kdh5ygh
 Managers: 1
 Nodes: 2
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 10.0.0.161
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-68-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.337 GiB
Name: ip-10-0-0-161
ID: AZKE:RARU:VGAZ:R6VQ:FWGS:DPNC:REID:QM2P:XRNO:CEL3:7K2C:FV2P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: **********
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8

Additional environment details (AWS, VirtualBox, physical, etc.): Running a two-node swarm on AWS. 1 managers, 1 workers.

Steps to reproduce the issue:

There are 2 distinct web services being run in the swarm. One service (let’s call it foo) runs it’s application on port 3031, the other (bar) on 4000. Each binds to that same port on the host machine. Things get created without issue replicas=5 so there are a mix of services running on both nodes.
After some time, requests intended for foo are actually getting routed to bar. Even though the two are running on different container ports, and running on different exposed swarm ports.

Describe the results you received: I would expect that requests to swarm for port 4000 would route to bar, and 3031 would route to a foo container.

Describe the results you expected: After some time, requests intended for foo are actually getting routed to bar. Even though the two are running on different container ports, and running on different exposed swarm ports.

Additional information you deem important (e.g. issue happens only occasionally): This only happens after some time (meaning that anywhere from 5-10 updates to that service over 2-30 hours have occurred to ensure new images are being pushed into the wild.

Both services are attached to the ingress network, they do not have any other networks attached.

This tends to only happen to a isolated nodes in the swarm, not across the entire swarm. Also it always seems to happen to a specific container thats being routed to on that node. For example, the problem always seems to lead to the following diagnosis:

Attempt to hit foo (3031) using cURL on node B (curl http://127.0.0.1:3031/foo-request)
Each request to this node seems to result in being load balanced to a container for bar (logs clearly show a bar container on node B getting the request for http://127.0.0.1:3031/foo-request
This node that is routing foo requests to bar definitely has foo services up and running.

docker service rm and then creating again, and letting it sit, seems to get things going for a short period of time, but things always get “confused” over time on nodes.

The swarm logs show this message occasionally, but don’t correlate with the issue: time="2016-08-22T08:33:35.667874882Z" level=warning msg="2016/08/22 08:33:35 [WARN] memberlist: Was able to reach ip-10-0-0-161 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"

Is there a way to look at the routing table being used on a node for the ingress network? I can’t seem to find a way to just look at the routing table to see if routes are not being updated after a period of time.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 17 (6 by maintainers)

Most upvoted comments

@mu5h3r On simulating a network partitioning scenario and see how the cluster recovers I found issues in the gossip cluster recovery and pushed a fix here docker/libnetwork#1446 which should fix the recovery part.

@dangra I don’t think what you are experiencing in #26590 is the same issue as this.

mrjana on Sep 16, 2016