moby: swarm mode duplicate ip addresses

Description We recently had several issues with our cloud provider which lead us to several docker swarms failing with strange errors. One thing we observied that from time to time IP adresses in the swarm overlay network are assigned duplicate. We made several attempts to reproduce this issue and finally found a way to reproduce the duplicate ip adress issue.

Steps to reproduce the issue:

Create a docker swarm with several nodes (five in our example)
Start some Services in the swarm with at least two replicas
Start some containers on swarm nodes, that are attached to the swarm overlay network
kill the docker process (kill -9 <main_pid_from systemctl status docker>) and/or kill some service tasks running on the same node (kill -9 <main_pid_from docker inspect>)
do a “docker network inspect -v <overlay_network>” on the node
start and stop some services in the swarm

Describe the results you received: I can see that ip adresses are assigned to more than one service task.

Describe the results you expected: I would expect that every service task is assigned a unique ip adress.

Additional information you deem important (e.g. issue happens only occasionally):

What we tried to resolve this issue:

let affected node leave the swarm (docker swarm leave, docker node demote, docker node rm)
stop the docker daemon on the affected node
rm -rf /var/lib/docker/*
start the docker daemon
rejoin the swarm

there are still duplicate ip adresses (ocker network inspect -v <overlay_network>)

next try:

let node leave the swarm again
stop the host
delete the host
create a new host (new host name, new ip address)
let new host join swarm as manager

again, there are duplicate ip adresses which are assigned to the host that was completely removed from the swarm (see attached file overlay.txt) overlay.txt Output of docker version: Client: Version: 17.05.0-ce API version: 1.29 Go version: go1.7.5 Git commit: 89658be Built: Thu May 4 22:06:25 2017 OS/Arch: linux/amd64

Server: Version: 17.05.0-ce API version: 1.29 (minimum version 1.12) Go version: go1.7.5 Git commit: 89658be Built: Thu May 4 22:06:25 2017 OS/Arch: linux/amd64 Experimental: false

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:25 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:06:25 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info: see attached file docker_info.txt docker_info.txt Additional environment details (AWS, VirtualBox, physical, etc.):

Environment: OS: CentOS 7 running in OpenStack in a public cloud.

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 2
Comments: 36 (14 by maintainers)

Most upvoted comments

@selfisch can you try 17.10, the IP allocation is now serialized, so till you don’t get close to the exhaustion of the IP space, you should not see this behavior anymore.

fcrisciani on Oct 23, 2017

Just seen this issue again using 17.12.0. Debian 9, Docker 17.12.0, 5 Node Swarm. Leader was node 1, I stopped the docker service on node 1 and node 3 got elected leader with this log from node 3. 172.20.0.0/16 is a Swarm overlay network. I can also confirm that by explicitly setting EndpointSpec to ‘vip’ this problem doesn’t occur.

Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.089601263Z" level=info msg="raft.node: 3df26cdd6b9ca2f6 lost leader 8aa0070a47725b9 at term 3" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.090749705Z" level=error msg="error sending message to peer" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable"
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.094278372Z" level=info msg="3df26cdd6b9ca2f6 received MsgVoteResp from 5638169dcb549a49 at term 3" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.094310576Z" level=info msg="3df26cdd6b9ca2f6 [quorum:3] has received 2 MsgVoteResp votes and 0 vote rejections" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.094770629Z" level=info msg="3df26cdd6b9ca2f6 received MsgVoteResp from 7806e19ce2afc1ce at term 3" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.094794939Z" level=info msg="3df26cdd6b9ca2f6 [quorum:3] has received 3 MsgVoteResp votes and 0 vote rejections" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.094821283Z" level=info msg="3df26cdd6b9ca2f6 became leader at term 3" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.094841777Z" level=info msg="raft.node: 3df26cdd6b9ca2f6 elected leader 3df26cdd6b9ca2f6 at term 3" module=raft node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.095630852Z" level=error msg="error sending message to peer" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable"
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.097995658Z" level=error msg="error creating cluster object" error="name conflicts with an existing object" module=node node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.098058880Z" level=error msg="failed to create predefined network bridge" error="name conflicts with an existing object" module=node node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.098082007Z" level=error msg="failed to create predefined network host" error="name conflicts with an existing object" module=node node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121627778Z" level=error msg="IPAM failure while releasing VIP address 172.20.0.17/16" error="invalid pool id: "
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121665630Z" level=error msg="error deallocating vip" error="invalid pool id: " vip.addr=172.20.0.17/16 vip.network=0ac9fqojng0vlm7wzm1k5rnyv
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121699088Z" level=error msg="IPAM failure while releasing VIP address 172.20.0.13/16" error="invalid pool id: "
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121713297Z" level=error msg="error deallocating vip" error="invalid pool id: " vip.addr=172.20.0.13/16 vip.network=0ac9fqojng0vlm7wzm1k5rnyv
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121734630Z" level=error msg="IPAM failure while releasing VIP address 172.20.0.19/16" error="invalid pool id: "
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121756237Z" level=error msg="error deallocating vip" error="invalid pool id: " vip.addr=172.20.0.19/16 vip.network=0ac9fqojng0vlm7wzm1k5rnyv
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121913793Z" level=error msg="IPAM failure while releasing VIP address 172.20.0.33/16" error="invalid pool id: "
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.121933718Z" level=error msg="error deallocating vip" error="invalid pool id: " vip.addr=172.20.0.33/16 vip.network=0ac9fqojng0vlm7wzm1k5rnyv
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.128984876Z" level=error msg="failed allocating task 85w4kmajsv28yby3mar1ve87x during init" error="failed to allocate network IP for task 85w4kmajsv28yby3mar1ve87x network 0ac9fqojng0vlm7wzm1k5rnyv: could not allocate IP from IPAM: Address already in use" module=node node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.129096085Z" level=error msg="failed allocating task axocfqq3j1ttosq5loby1in92 during init" error="failed to allocate network IP for task axocfqq3j1ttosq5loby1in92 network 0ac9fqojng0vlm7wzm1k5rnyv: could not allocate IP from IPAM: Address already in use" module=node node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.137237698Z" level=error msg="failed allocating task 85w4kmajsv28yby3mar1ve87x during init" error="failed to allocate network IP for task 85w4kmajsv28yby3mar1ve87x network 0ac9fqojng0vlm7wzm1k5rnyv: could not allocate IP from IPAM: Address already in use" module=node node.id=yq65b6zkmascltgu132tx10hd
Jan 11 15:52:08 lcy2-hosting-capdocker-node03 dockerd[7987]: time="2018-01-11T15:52:08.137340244Z" level=error msg="failed allocating task axocfqq3j1ttosq5loby1in92 during init" error="failed to allocate network IP for task axocfqq3j1ttosq5loby1in92 network 0ac9fqojng0vlm7wzm1k5rnyv: could not allocate IP from IPAM: Address already in use" module=node node.id=yq65b6zkmascltgu132tx10hd

mcgoober on Jan 12, 2018

@fcrisciani yeah I was gonna say that maybe the issue is something more superficial. anyways, that is good to know and i believe the issue is resolved for us.

websitesca on Dec 14, 2017

Ok, short feedback.

With existing duplicated ip in swarm at version 17.09, performing update to 17.10, the issue will not be resolved automatically. But after updating all nodes to 17.10 and reinitializing the whole swarm, everything seems to be fine.

I restarted several nodes, several times this morning and there are no doub ips coming up. hurray 😃

selfisch on Oct 26, 2017

Ok, found a fairly straight forward way to reproduce this on 17.06.2: reboot one of the nodes.

What I do is dump the state of the services before and after the reboot, like so:

docker service inspect $(docker service ls -q | sort) > state

After the reboot we often get duplicate IP addresses for services. When diff the result of the above command you get the output below. Without any docker commands being executed, simply restarting one of the nodes causes the service IPs to be changed and duplicated. This makes no sense, but it’s what happens. And the IPs that get allocated overlap other existing services.

Note that with debugging enabled I can see no messages for ReleaseAddress or RequestAddress for these IPs. Which basically makes it look like the IP address are being made from broken state? But the entire address allocation code seems to be skipped?

--- state.before        2017-09-19 15:49:20.826119094 +0100
+++ state.after 2017-09-19 15:52:18.247053274 +0100
@@ -2,10 +2,10 @@
     {
         "ID": "3ghfeltiuaqehe8o61wwxcdn6",
         "Version": {
-            "Index": 88169
+            "Index": 88195
         },
         "CreatedAt": "2017-09-19T14:49:02.888033308Z",
-        "UpdatedAt": "2017-09-19T14:49:02.889995025Z",
+        "UpdatedAt": "2017-09-19T14:50:56.434654134Z",
         "Spec": {
             "Name": "test-service",
             "Labels": {
@@ -103,15 +103,15 @@
             "VirtualIPs": [
                 {
                     "NetworkID": "09bgu2kmbr8lpp43wxyaaadbs",
-                    "Addr": "10.0.2.26/24"
+                    "Addr": "10.0.2.2/24"
                 },
                 {
                     "NetworkID": "93m63jsve7ikw7ythevaqbskd",
-                    "Addr": "10.0.7.12/24"
+                    "Addr": "10.0.7.2/24"
                 },
                 {
                     "NetworkID": "2vjijgmbxq4cwhihniqeyk5i4",
-                    "Addr": "10.0.1.16/24"
+                    "Addr": "10.0.1.2/24"
                 }
             ]
         }

I found the tool swarm-raftlog to dump the logs, and so I found that the index 88195 is when the node leadership changes:

Entry Index=88195, Term=45, Type=EntryNormal:
id: 104018129567492
action: <
  action: STORE_ACTION_UPDATE
  node: <
    id: "fwslc19rqb6g2svc998kf02wf"
    meta: <
      version: <
        index: 88194
      >
      created_at: <
        seconds: 1501684854
        nanos: 558535663
      >
      updated_at: <
        seconds: 1505832656
        nanos: 428782461
      >
    >
...
    status: <
      message: "Node moved to \"unknown\" state due to leadership change in cluster"
      addr: "10.54.54.101"
    >

And the next entry updates all the running services, several to bogus IPs (service ekfjd93y2qgg15v1tare811ab had this IP already and it wasn’t changed in this update)

Entry Index=88196, Term=45, Type=EntryNormal:
id: 104018129567493
action: <
  action: STORE_ACTION_UPDATE
  service: <
    id: "3ghfeltiuaqehe8o61wwxcdn6"
    meta: <
      version: <
        index: 88195
      >
      created_at: <
        seconds: 1505832542
        nanos: 888033308
      >
      updated_at: <
        seconds: 1505832656
        nanos: 434654134
      >
    >
...
    endpoint: <
      virtual_ips: <
        network_id: "09bgu2kmbr8lpp43wxyaaadbs"
        addr: "10.0.2.2/24"
      >
      virtual_ips: <
        network_id: "93m63jsve7ikw7ythevaqbskd"
        addr: "10.0.7.2/24"
      >
      virtual_ips: <
        network_id: "2vjijgmbxq4cwhihniqeyk5i4"
        addr: "10.0.1.2/24"
      >
    >
  >
>
... + 15 or so other services, including
action: <
  action: STORE_ACTION_UPDATE
  service: <
    id: "ekfjd93y2qgg15v1tare811ab"
    meta: <
      version: <
        index: 88195
      >
      created_at: <
        seconds: 1501684989
        nanos: 923676596
      >
      updated_at: <
        seconds: 1505832656
        nanos: 434854046
      >
    >
...
    endpoint: <
      virtual_ips: <
        network_id: "09bgu2kmbr8lpp43wxyaaadbs"
        addr: "10.0.2.2/24"
      >
    >
  >
>

So it seems that right after the node leadership changes the services are updated with bogus Service IPs in one single large update. The IP 10.0.2.2/24 is now duplicated.

BTW, it looks like the network allocations are not logged, so the allocation status bitmap must be either transmitted some other way, or the new master must regenerate it from the data it has?

For completeness the original raft records that created the service:

Entry Index=88169, Term=44, Type=EntryNormal:
id: 104018130441308
action: <
  action: STORE_ACTION_CREATE
  service: <
    id: "3ghfeltiuaqehe8o61wwxcdn6"
    meta: <
      version: <
        index: 88168
      >
      created_at: <
        seconds: 1505832542
        nanos: 888033308
      >
      updated_at: <
        seconds: 1505832542
        nanos: 888033308
      >
    >
... no endpoints
>
Entry Index=88170, Term=44, Type=EntryNormal:
id: 104018130441309
action: <
  action: STORE_ACTION_UPDATE
  service: <
    id: "3ghfeltiuaqehe8o61wwxcdn6"
    meta: <
      version: <
        index: 88169
      >
      created_at: <
        seconds: 1505832542
        nanos: 888033308
      >
      updated_at: <
        seconds: 1505832542
        nanos: 889995025
      >
    >
...
    endpoint: <
      virtual_ips: <
        network_id: "09bgu2kmbr8lpp43wxyaaadbs"
        addr: "10.0.2.26/24"
      >
      virtual_ips: <
        network_id: "93m63jsve7ikw7ythevaqbskd"
        addr: "10.0.7.12/24"
      >
      virtual_ips: <
        network_id: "2vjijgmbxq4cwhihniqeyk5i4"
        addr: "10.0.1.16/24"
      >
    >
  >
>

# docker info
Containers: 22
 Running: 6
 Paused: 0
 Stopped: 16
Images: 43
Server Version: 17.06.2-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 330
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: fwslc19rqb6g2svc998kf02wf
 Is Manager: true
 ClusterID: rlcxduvxmyqajvjy7dtpq1kmj
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Root Rotation In Progress: false
 Node Address: 10.54.54.101
 Manager Addresses:
  10.54.54.101:2377
  10.54.54.102:2377
  10.54.54.103:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 5.846GiB
Name: dc1_worker1
ID: KM77:6M2N:PF4V:YYO3:SMFF:IEAV:DYWN:AJJ6:ZEAL:33QR:5P4Q:2BAP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 123
 Goroutines: 244
 System Time: 2017-09-19T17:16:11.550805212+01:00
 EventsListeners: 6
Http Proxy: http://10.54.54.1:3128
Https Proxy: http://10.54.54.1:3128
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No memory limit support
WARNING: No swap limit support
WARNING: No kernel memory limit support
WARNING: No oom kill disable support
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support

kleptog on Sep 19, 2017