rancher: cross node communication failures
Rancher versions: rancher/server: 1.6.5 rancher/agent: :v1.2.5
Infrastructure Stack versions: healthcheck: 0.3.1 ipsec: holder network-services: 0.7.4 metadata 0.9.2 scheduler: 0.8.2
Docker version: (docker version
,docker info
preferred)
rancher@rancher:~$ docker info
Containers: 32
Running: 1
Paused: 0
Stopped: 31
Images: 20
Server Version: 17.03.1-ce
Storage Driver: overlay
Backing Filesystem: extfs
Supports d_type: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local rancher-nfs
Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.34-rancher
Operating System: RancherOS v1.0.3
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 4.834 GiB
Name: rancher
ID: DRVM:BVVB:X6KB:UY3N:RKPD:EI6L:HNLC:WVBP:ZC3R:ADYR:QOHZ:IKHH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Operating system and kernel: (cat /etc/os-release
, uname -r
preferred)
rancher@rancher:~$ cat /etc/os-release
NAME="RancherOS"
VERSION=v1.0.3
ID=rancheros
ID_LIKE=
VERSION_ID=v1.0.3
PRETTY_NAME="RancherOS v1.0.3"
HOME_URL="http://rancher.com/rancher-os/"
SUPPORT_URL="https://forums.rancher.com/c/rancher-os"
BUG_REPORT_URL="https://github.com/rancher/os/issues"
BUILD_ID=
rancher@rancher:~$ uname -r
4.9.34-rancher
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
KVM
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
single node rancher
Environment Template: (Cattle/Kubernetes/Swarm/Mesos)
cattle
Steps to Reproduce:
setup three rancher nodes, all in the same KVM host with a bridged interface with one assignable public IP, and a private virtual network for all the nodes to connect to each other from.
one with dual NIC and both private and public IPs, and label publice=true
,
two with single NIC and only private IP and label private=true
Results:
eventually I find containers lose communication across one of the containers like in this issue. Now here’s the weird part I “fixed” this by evacuating one of the private=true
hosts. As long as the host stays inactive everything stays green and good. But if I activate both these weird problems start cropping up. Evacuate
and in about ten minutes everything is back to green again. It seems to be preceded by one of the private host’s healthchecks going yellow. The healthcheck appears to be fine other than losing it’s ability to communicate with other containers, log look like this:
7/25/2017 2:52:53 PMtime="2017-07-25T19:52:53Z" level=info msg="Scheduling apply config"
7/25/2017 2:52:53 PMtime="2017-07-25T19:52:53Z" level=info msg="healthCheck -- no changes in haproxy config\n"
I am at a loss here as to explain what is going wrong. Any ideas? And does anyone else see this behavior? It seems to happen only on long running environments (months and have survived many upgrades of rancher and rancherOS beneath it).
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 16 (2 by maintainers)
@joshuacox when you are registering the host can you try using
CATTLE_AGENT_IP
so that cattle (rancher/server) doesn’t use the KVM host’s IP address for both of these nodes?Can you share a screenshot of the Infrastructure -> Hosts view?