moby: Docker Swarm node starts routing requests to wrong containers
Output of docker version
:
Client:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built: Thu Aug 18 05:22:43 2016
OS/Arch: linux/amd64
Server:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built: Thu Aug 18 05:22:43 2016
OS/Arch: linux/amd64
Output of docker info
:
Containers: 33
Running: 3
Paused: 0
Stopped: 30
Images: 15
Server Version: 1.12.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 277
Dirperm1 Supported: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge overlay null host
Swarm: active
NodeID: 16e15nh1rf84uf6k5h6czxmp9
Is Manager: true
ClusterID: 9du6skrzpon6x3dfv8kdh5ygh
Managers: 1
Nodes: 2
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 10.0.0.161
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 3.13.0-68-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.337 GiB
Name: ip-10-0-0-161
ID: AZKE:RARU:VGAZ:R6VQ:FWGS:DPNC:REID:QM2P:XRNO:CEL3:7K2C:FV2P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: **********
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Additional environment details (AWS, VirtualBox, physical, etc.): Running a two-node swarm on AWS. 1 managers, 1 workers.
Steps to reproduce the issue:
- There are 2 distinct web services being run in the swarm. One service (let’s call it
foo
) runs it’s application on port 3031, the other (bar
) on 4000. Each binds to that same port on the host machine. Things get created without issuereplicas=5
so there are a mix of services running on both nodes. - After some time, requests intended for
foo
are actually getting routed tobar
. Even though the two are running on different container ports, and running on different exposed swarm ports.
Describe the results you received:
I would expect that requests to swarm for port 4000 would route to bar
, and 3031 would route to a foo
container.
Describe the results you expected:
After some time, requests intended for foo
are actually getting routed to bar
. Even though the two are running on different container ports, and running on different exposed swarm ports.
Additional information you deem important (e.g. issue happens only occasionally): This only happens after some time (meaning that anywhere from 5-10 updates to that service over 2-30 hours have occurred to ensure new images are being pushed into the wild.
Both services are attached to the ingress
network, they do not have any other networks attached.
This tends to only happen to a isolated nodes in the swarm, not across the entire swarm. Also it always seems to happen to a specific container thats being routed to on that node. For example, the problem always seems to lead to the following diagnosis:
- Attempt to hit
foo
(3031) using cURL on node B (curl http://127.0.0.1:3031/foo-request
) - Each request to this node seems to result in being load balanced to a container for
bar
(logs clearly show abar
container on node B getting the request forhttp://127.0.0.1:3031/foo-request
- This node that is routing
foo
requests tobar
definitely hasfoo
services up and running.
docker service rm
and then creating again, and letting it sit, seems to get things going for a short period of time, but things always get “confused” over time on nodes.
The swarm logs show this message occasionally, but don’t correlate with the issue: time="2016-08-22T08:33:35.667874882Z" level=warning msg="2016/08/22 08:33:35 [WARN] memberlist: Was able to reach ip-10-0-0-161 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP\n"
Is there a way to look at the routing table being used on a node for the ingress network? I can’t seem to find a way to just look at the routing table to see if routes are not being updated after a period of time.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 17 (6 by maintainers)
@mu5h3r On simulating a network partitioning scenario and see how the cluster recovers I found issues in the gossip cluster recovery and pushed a fix here docker/libnetwork#1446 which should fix the recovery part.
@dangra I don’t think what you are experiencing in #26590 is the same issue as this.