moby: Upgrading Swarm Manager to 17.12.0 from 17.09.01 breaks ingress network

Description

Steps to reproduce the issue:

  1. Create a 3 manager swarm cluster with 17.09.01-ce
  2. Drain the first manger you want to upgrade
  3. Upgrade that swarm manager to 17.12.0-ce
  4. Set manager back to active and deploy a service that uses the ingress to it
  5. View status of the ingress network on the new manager

Describe the results you received: The upgraded swarm manager lost it’s ability for the ingress network to work correctly and cannot find it’s peers.

"Failed to find a load balancer IP to use for network: jttyybmsk9k45p8o2w95huz52"

Created date is borked as well as no peers showing. docker network inspect ingress

[
    {
        "Name": "ingress",
        "Id": "jttyybmsk9k45p8o2w95huz52",
        "Created": "0001-01-01T00:00:00Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.255.0.0/16",
                    "Gateway": "10.255.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": true,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4096"
        },
        "Labels": null
    }
]

> journalctl -u docker.service

Jan 05 15:54:58 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:54:58.292600030Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:54:59 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:54:59.292307212Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:00 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:00.292482839Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:01 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:01.292923988Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:02 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:02.293461514Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:03 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:03.294105186Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:04 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:04.294598332Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:05 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:05.295028437Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:06 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:06.295668929Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"
Jan 05 15:55:07 dev-swarm-manager-1 dockerd[2565]: time="2018-01-05T15:55:07.295948610Z" level=error msg="error receiving response" error="rpc error: code = Unimplemented desc = unknown method StreamRaftMessage"

Describe the results you expected:

[
    {
        "Name": "ingress",
        "Id": "jttyybmsk9k45p8o2w95huz52",
        "Created": "2018-01-05T15:59:00.330647797Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.255.0.0/16",
                    "Gateway": "10.255.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": true,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "43c0062ea6cac4281985a5ae83c6924b7e4e5ddb493396a5bbc467e2fcdfec46",
                "MacAddress": "02:42:0a:ff:00:03",
                "IPv4Address": "10.255.0.3/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4096"
        },
        "Labels": {},
        "Peers": [
            {
                "Name": "dev-swarm-manager-2-59f2ef982566",
                "IP": "10.21.5.6"
            },
            {
                "Name": "dev-swarm-manager-3-0fd3c4b8bb56",
                "IP": "10.21.5.3"
            },
            {
                "Name": "dev-swarm-worker-2-4cedf0af0db1",
                "IP": "10.21.5.9"
            },
            {
                "Name": "dev-swarm-worker-1-e9ec013b553c",
                "IP": "10.21.5.7"
            },
            {
                "Name": "dev-swarm-worker-3-9a9051af1700",
                "IP": "10.21.5.8"
            },
            {
                "Name": "dev-swarm-manager-1-44e67a971c61",
                "IP": "10.21.5.4"
            }
        ]
    }
]

> journalctl -u docker.service should output the standard info msg showing peer joins

Jan 05 16:01:26 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:01:26.408714646Z" level=info msg="Node join event for dev-swarm-worker-1-e9ec013b553c/10.21.5.7"
Jan 05 16:01:29 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:01:29.631089430Z" level=info msg="Node join event for dev-swarm-manager-2-59f2ef982566/10.21.5.6"
Jan 05 16:01:56 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:01:56.411635308Z" level=info msg="Node join event for dev-swarm-worker-3-9a9051af1700/10.21.5.8"
Jan 05 16:02:26 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:02:26.414415630Z" level=info msg="Node join event for dev-swarm-manager-2-59f2ef982566/10.21.5.6"
Jan 05 16:02:56 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:02:56.417536038Z" level=info msg="Node join event for dev-swarm-worker-3-9a9051af1700/10.21.5.8"
Jan 05 16:02:59 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:02:59.630183224Z" level=info msg="Node join event for dev-swarm-manager-2-59f2ef982566/10.21.5.6"
Jan 05 16:03:26 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:03:26.420245662Z" level=info msg="Node join event for dev-swarm-worker-2-4cedf0af0db1/10.21.5.9"
Jan 05 16:03:29 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:03:29.177811962Z" level=info msg="Node join event for dev-swarm-worker-1-e9ec013b553c/10.21.5.7"
Jan 05 16:03:56 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:03:56.422717132Z" level=info msg="Node join event for dev-swarm-manager-2-59f2ef982566/10.21.5.6"
Jan 05 16:03:59 dev-swarm-manager-1 dockerd[2597]: time="2018-01-05T16:03:59.178312215Z" level=info msg="Node join event for dev-swarm-worker-1-e9ec013b553c/10.21.5.7"

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

root@dev-swarm-manager-1:/# docker version
Client:
 Version:	17.12.0-ce
 API version:	1.35
 Go version:	go1.9.2
 Git commit:	c97c6d6
 Built:	Wed Dec 27 20:11:19 2017
 OS/Arch:	linux/amd64

Server:
 Engine:
  Version:	17.12.0-ce
  API version:	1.35 (minimum version 1.12)
  Go version:	go1.9.2
  Git commit:	c97c6d6
  Built:	Wed Dec 27 20:09:53 2017
  OS/Arch:	linux/amd64
  Experimental:	false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 17.12.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: active
 NodeID: 3i3pigwa7vtugntyi7iglrztg
 Is Manager: true
 ClusterID: 1pj89qddk5ttm4t7nxqb4kgld
 Managers: 3
 Nodes: 6
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
  Force Rotate: 0
 Autolock Managers: false
 Root Rotation In Progress: false
 Node Address: 10.21.5.4
 Manager Addresses:
  10.21.5.3:2377
  10.21.5.4:2377
  10.21.5.6:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: b2567b37d7b75eb4cf325b77297b140ea686ce8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-1002-gcp
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.301GiB
Name: dev-swarm-manager-1
ID: ZGSG:LCO4:MHJS:EAUT:OILG:GSSH:QGBZ:Q3M2:BMFL:2UIW:Q7LY:DQBP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): Google Cloud Platform GCE instances

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 9
  • Comments: 40 (12 by maintainers)

Most upvoted comments

Tested and run OK for me!

Tested as well. The whole cluster update to 17.12.1-CE went good. Thanks guys !

Looks like 17.12.1-CE has addressed this and is available as of Feb 27. Anyone tried upgrading yet?

I can validate that this issue has not been fixed yet, even if upgrading all the servers in the swarm, overlay networking continues to NOT work.

I’ve just joined a 17.12.0 node to a swarm of 17.09.1 nodes. It was only when I drained all the older managers ready for upgrade that the entire cluster went dark unexpectedly at the network level. They all claim to be healthy - the new node is the only one that does not feature a docker_gwbridge interface and features the OP’s error in the logs.

This should probably be mentioned in the “known issues” perhaps?

I encountered the same issue after upgrade to 17.12.0. This is quite broken for a release! I had to revert back to 17.09 to get my global services to re-launch on a node that was drained before upgrade. In the end I needed to force the node out of the swarm and rejoin to recover, in case that is a help to anyone else in the same boat. As well I’m running with --storage-driver=devicemapper, and had to destroy/recreate my thinpool to revert.