rancher: cross node communication failures

Rancher versions: rancher/server: 1.6.5 rancher/agent: :v1.2.5

Infrastructure Stack versions: healthcheck: 0.3.1 ipsec: holder network-services: 0.7.4 metadata 0.9.2 scheduler: 0.8.2

Docker version: (docker version,docker info preferred)

rancher@rancher:~$ docker info
Containers: 32
 Running: 1
 Paused: 0
 Stopped: 31
Images: 20
Server Version: 17.03.1-ce
Storage Driver: overlay
 Backing Filesystem: extfs
 Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local rancher-nfs
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.34-rancher
Operating System: RancherOS v1.0.3
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 4.834 GiB
Name: rancher
ID: DRVM:BVVB:X6KB:UY3N:RKPD:EI6L:HNLC:WVBP:ZC3R:ADYR:QOHZ:IKHH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

rancher@rancher:~$ cat /etc/os-release 
NAME="RancherOS"
VERSION=v1.0.3
ID=rancheros
ID_LIKE=
VERSION_ID=v1.0.3
PRETTY_NAME="RancherOS v1.0.3"
HOME_URL="http://rancher.com/rancher-os/"
SUPPORT_URL="https://forums.rancher.com/c/rancher-os"
BUG_REPORT_URL="https://github.com/rancher/os/issues"
BUILD_ID=
rancher@rancher:~$ uname -r
4.9.34-rancher

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)

KVM

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)

single node rancher

Environment Template: (Cattle/Kubernetes/Swarm/Mesos)

cattle

Steps to Reproduce:

setup three rancher nodes, all in the same KVM host with a bridged interface with one assignable public IP, and a private virtual network for all the nodes to connect to each other from.

one with dual NIC and both private and public IPs, and label publice=true,

two with single NIC and only private IP and label private=true

Results:

eventually I find containers lose communication across one of the containers like in this issue. Now here’s the weird part I “fixed” this by evacuating one of the private=true hosts. As long as the host stays inactive everything stays green and good. But if I activate both these weird problems start cropping up. Evacuate and in about ten minutes everything is back to green again. It seems to be preceded by one of the private host’s healthchecks going yellow. The healthcheck appears to be fine other than losing it’s ability to communicate with other containers, log look like this:

7/25/2017 2:52:53 PMtime="2017-07-25T19:52:53Z" level=info msg="Scheduling apply config"
7/25/2017 2:52:53 PMtime="2017-07-25T19:52:53Z" level=info msg="healthCheck -- no changes in haproxy config\n"

I am at a loss here as to explain what is going wrong. Any ideas? And does anyone else see this behavior? It seems to happen only on long running environments (months and have survived many upgrades of rancher and rancherOS beneath it).

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 16 (2 by maintainers)

Most upvoted comments

@joshuacox when you are registering the host can you try using CATTLE_AGENT_IP so that cattle (rancher/server) doesn’t use the KVM host’s IP address for both of these nodes?

leodotcloud on Jul 28, 2017

Can you share a screenshot of the Infrastructure -> Hosts view?

superseb on Jul 25, 2017