moby: IP allocation problems after a node has crashed

Description

I’m trying to track down an issue in the IP allocation code in swarm mode in 17.03. The symptom is that you get errors like the following in the logs:

dockerd[592]: time="2017-03-03T07:30:08.877627238+01:00" level=error msg="task allocation failure" error="failed during networktask allocation for task pptolnhiauhhben0i1t0sjb58: failed to allocate network IP for task pptolnhiauhhben0i1t0sjb58 network 004pvcgpxy5tm8xrq5y5m5caf: could not find an available IP"

This happens eventually to all containers, even though no containers are actually running. I’ve been instrumenting the code to see if I can work out what it going on. It appears that there is a discrepancy between the bitmap used to track allocations and the actually allocated IPs. So it thinks that there is nothing available when there actually is.

I tried asking on the mailing list at https://groups.google.com/forum/#!topic/docker-dev/5AEob-sJbpc but they suggested filing a bug.

Steps to reproduce the issue:

Start a lot of containers on a docker swarm
“systemctl stop docker” on one of the nodes (not the leader)
Look in the debug logs

Describe the results you received:

I’ve instrumented to docker source to provide extra information about the IP allocation, which can be seen below.

First I’m going to show the allocation of one particular task (icms53g11wyyezlp2slkk114w):

Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.270848161+01:00" level=debug msg=AllocateTask network.ID=004pvcgpxy5tm8xrq5y5m5caf network.addrs=[] task.ID=icms53g11wyyezlp2slkk114w
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.270878276+01:00" level=debug msg=allocateNetworkIPs addr=<nil> pool="GlobalDefault/10.0.2.0/24"
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.270894680+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[])"
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.270918397+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[]) -> 10.0.2.35"
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.271817501+01:00" level=debug msg="state changed" module="node/agent/taskmanager" state.desired=RUNNING state.transition="READY->STARTING" task.id=0iy76wr1qiotg8rxijk1khkrg
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.272274868+01:00" level=debug msg="(*Agent).UpdateTaskStatus" module="node/agent" task.id=0iy76wr1qiotg8rxijk1khkrg

Here the address 10.0.2.35 is assigned to task icms53g11wy, which is on the node we are going to fail. Here’s the state afterward to show the relationship between the tasks.

ID            NAME                   IMAGE                                                                                                                  NODE          DESIRED STATE  CURRENT STATE            ERROR                             PORTS
jirumn4fviot  test--1-database.1      registry.local/test/postgres@sha256:3faad8747c01cc2615fc1309406f3b199805a2c07c84e4e10a605e16c3f81628  b2-test  Running        Running 18 minutes ago                                     
icms53g11wyy   \_ test--1-database.1  registry.local/test/postgres@sha256:3faad8747c01cc2615fc1309406f3b199805a2c07c84e4e10a605e16c3f81628  b1-test  Shutdown       Shutdown 14 minutes ago

Now we stop node b1. This leads to the following messages on the leader:

Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.734984509+01:00" level=debug msg="13a40866699b26e7 failed to send message to 24c4132b76046066 because it is unreachable
 [next = 50708, match = 50707, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]"
...
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.750546525+01:00" level=debug msg=AllocateTask network.ID=004pvcgpxy5tm8xrq5y5m5caf network.addrs=[] task.ID=jirumn4fviotsiriuumsjzwua
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.750573040+01:00" level=debug msg=allocateNetworkIPs addr=<nil> pool="GlobalDefault/10.0.2.0/24"
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.750588389+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[])"
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.750607776+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[]) -> 10.0.2.64"
...
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.798300618+01:00" level=debug msg=AllocateTask network.ID=004pvcgpxy5tm8xrq5y5m5caf network.addrs=[] task.ID=jirumn4fviotsiriuumsjzwua
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.798357874+01:00" level=debug msg=allocateNetworkIPs addr=<nil> pool="GlobalDefault/10.0.2.0/24"
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.798377503+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[])"
Mar  9 13:13:26 b2-test dockerd[31967]: time="2017-03-09T13:13:26.798419297+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[]) -> 10.0.2.74"

Note how now two addresses have been allocated to task jirumn4fviotsiriuumsjzwua, 10.0.2.64 and 10.0.2.74.

ReleaseAddress(10.0.2.74) is never called, this address is leaked permanently.

Describe the results you expected:

That the address 10.0.2.74 would be returned to the pool.

Additional information you deem important (e.g. issue happens only occasionally):

This allocation problem seems to happen for every overlay network on every task on the node that fails. Does not seems to affect the ingress network. It generally takes a while to actually run out of IPs but with only 254 addresses per network it doesn’t take that long. Does not seem to affect global mode services.

Note this I haven’t been able to trigger this on a trivial swarm. In particular, when I try to trigger it in a test case.

May or may not be relevant:

Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.487664361+01:00" level=debug msg="assigning to node 0viyjq1wphptbllu3guubjqz1" task.id=icms53g11wyyezlp2slkk114w
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.487681921+01:00" level=info msg="addTask icms53g11wyyezlp2slkk114w (node 0viyjq1wphptbllu3guubjqz1) -= (643000000, 1000000) => (957410752, 1864481000)"
...
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.488716198+01:00" level=debug msg="assigning to node gvyh45n4iccw9ip0u7p0k0w61" task.id=icms53g11wyyezlp2slkk114w
Mar  9 13:12:42 b2-test dockerd[31967]: time="2017-03-09T13:12:42.488739701+01:00" level=info msg="addTask icms53g11wyyezlp2slkk114w (node gvyh45n4iccw9ip0u7p0k0w61) -= (643000000, 1000000) => (122410752, 1729432000)"

The task got assigned to two nodes. This incidentally also causes problems with the resource tracking.

Output of docker version:

Client:
 Version:      17.03.0-ce
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   b2cba28-unsupported
 Built:        Thu Mar  9 12:05:56 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.0-ce
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   b2cba28-unsupported
 Built:        Thu Mar  9 12:05:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 20
 Running: 6
 Paused: 0
 Stopped: 14
Images: 23
Server Version: 17.03.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 495
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: 0viyjq1wphptbllu3guubjqz1
 Is Manager: true
 ClusterID: ca0bxa3fc6iv7onw08nlsea2b
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.16.102.103
 Manager Addresses:
  172.16.102.101:2377
  172.16.102.102:2377
  172.16.102.103:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.873 GiB
Name: b3-test
ID: 5EJE:N4PB:GHW4:ACZO:GUPI:AHEK:QDTK:A35I:N7LO:EDX4:NXL2:YW2U
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 104
 Goroutines: 221
 System Time: 2017-03-09T13:51:48.49269439+01:00
 EventsListeners: 6
Registry: https://index.docker.io/v1/
WARNING: No kernel memory limit support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Debian Jessie Linux on VMware

# docker node ls
ID                           HOSTNAME      STATUS  AVAILABILITY  MANAGER STATUS
0viyjq1wphptbllu3guubjqz1 *  b3-test  Ready   Active        Reachable
gvyh45n4iccw9ip0u7p0k0w61    b1-test  Down    Active        Unreachable
oiv4mjh3us46phi0gjnmz5jup    b2-test  Ready   Active        Leader

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 22 (8 by maintainers)

Most upvoted comments

We were able to replicate this issue, or at least a similar problem. Our worker nodes would crash their Docker service without any logs until we removed them from the swarm (at the master node), and then issued a restart to docker. Only after re-configuring our network settings in our stack were we able to have our workers rejoin the swarm.

How we re-created the issue:

Have a swarm master and hundreds of worker nodes setup in a single swarm
Our stackfile has network:host for 4/5 services, but one service with default networking. This created a default overlay network.
Try to scale each service to a 1:1 node, which exhausts the 256 ip assignments.
Watch several nodes start dropping from the swarm master.

CaptainYarb on Jan 23, 2018