rancher: Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip)
Rancher Versions: Server: 1.5.2 healthcheck: 0.2.3 ipsec: rancher/net:holder network-services: rancher/network-manager:v0.5.3 scheduler: rancher/scheduler:v0.7.5 kubernetes (if applicable):
Docker Version: 1.12.6
OS and where are the hosts located? (cloud, bare metal, etc): Ubuntu 16.04/Bare Metal/Multiple locations
Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single node rancher, external DB
Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle
Steps to Reproduce:
I have about 100 hosts running 4 containers. On 5-6 of them ipsec runs ok, on every other host the ipsec
stack fails (also the healthcheck
) with Timeout getting ip
. Inside the network-manager logs I can see:
nsenter: cannot open /proc/26034/ns/ipc: No such file or directory
time="2017-04-01T02:31:01Z" level=error msg="Failed to evaluate network state for 059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
time="2017-04-01T02:31:01Z" level=error msg="Error processing event &docker.APIEvents{Action:\"start\", Type:\"container\", Actor:docker.APIActor{ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", Attributes:map[string]string{\"io.rancher.container.agent_id\":\"58465\", \"io.rancher.project_service.name\":\"healthcheck/healthcheck\", \"io.rancher.service.hash\":\"d0a8fd4061d3b2a8c5782f5563db6df0b25655cb\", \"io.rancher.stack_service.name\":\"healthcheck/healthcheck\", \"io.rancher.cni.network\":\"ipsec\", \"io.rancher.scheduler.global\":\"true\", \"io.rancher.service.requested.host.id\":\"209\", \"io.rancher.stack.name\":\"healthcheck\", \"name\":\"r-healthcheck-healthcheck-28-2576d08e\", \"image\":\"rancher/healthcheck:v0.2.3\", \"io.rancher.container.create_agent\":\"true\", \"io.rancher.container.ip\":\"10.42.112.234/16\", \"io.rancher.container.mac_address\":\"02:4f:71:44:58:1a\", \"io.rancher.service.launch.config\":\"io.rancher.service.primary.launch.config\", \"io.rancher.cni.wait\":\"true\", \"io.rancher.container.system\":\"true\", \"io.rancher.container.uuid\":\"2576d08e-3100-4935-aa34-3ecb73da302a\", \"io.rancher.project.name\":\"healthcheck\", \"io.rancher.service.deployment.unit\":\"8a360a36-5ff6-497f-a7db-92193ecf2415\", \"io.rancher.container.name\":\"healthcheck-healthcheck-28\"}}, Status:\"start\", ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", From:\"rancher/healthcheck:v0.2.3\", Time:1491013861, TimeNano:1491013861288487613}. Error: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input"
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 11
- Comments: 102 (10 by maintainers)
We’ve been struggling all day with ipsec issues on one of the nodes after upgrading a rancher environment running 1.4 to 1.5.6 (in other words, we have several nodes where ipsec was working).
After a lot of headscratching and cussing, we got ipsec working following these simple steps:
/var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge
from one of the working nodes (we were missing this file on the node with ipsec issues).nsenter
, and needs the PID of the target process it should enter. this needs to be adjusted for each host and time.-t
option tonsenter
with the pid ofps ax | grep rancher-cni
(it should be the sidekick tailing the log file:/var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge
after patching:network-manager
and things starts rolling!! 😉So, in conclusion, it seems like the process responsible for updating this file is failing, for some to us yet unknown reason (haven’t had time to look into that yet).
Just got off call with @fedya … The reason for his problem was different than above.
Summary: Host 1: Running both
rancher/server
andrancher/agent
… And this caused the IP detected for this host in Cattle as172.17.0.1
. Host 2: This is the host having problem "Error … "When running both server and agent, it’s needed to pass
CATTLE_AGENT_IP
as the public IP address of the host while registering the host.Fix:
rancher-agent
container on Host1rancher-agent
again on Host1, but this time passCATTLE_AGENT_IP=PUBLIC_IP_OF_HOST_1
@ciokan Thank you for taking time to share your setup for debugging.
The error message:
Failed to evaluate networking: failed to find plugin \"rancher-bridge\"
could be due to a variety of problems. One of the reasons is as below:Summary: TL; DR version: If the IP address of the
docker0
bridge on any host is same as the IP address assigned to any container, especially metadata container, in the cluster this problem surfaces.Long version: All of Rancher’s infra microservices/containers running on any host of the cluster rely on metadata to fetch the information needed to do their jobs.
rancher-metadata
answers the queries in an intelligent fashion, giving out information based on the source IP address of the request. So169.254.169.250/latest/self/container/
will yield different answers for various containers. If the source IP address lookup fails inrancher-metadata
only parts of the information which are common to all the containers on that host are available while querying. Ex: Only169.254.169.250/latest/self/host
is available but not169.254.169.250/latest/self/container
as we cannot determine from where the request is coming from.When a new host is added to rancher, the first microservice to start would be
rancher-metadata
along with it’s sidekickrancher-dns
. This microservice is on docker’s bridge network and gets an IP address from the subnet configured for docker. Example: If the bridge subnet is172.17.0.0/16
, an IP address in the range172.17.0.1
to172.17.255.254
is allocated for this container. In general the first IP address of this container, i.e.,172.17.0.1
is configured on the bridge,rancher-metadata
generally/most of the times gets172.17.0.2
. While debugging @ciokan’s setup, we noticed that some of therancher-metadata
containers in the cluster had an IP address of172.17.0.1
, which was quite unusual and upon further investigation we founddocker0
bridge had an IP:172.17.42.1
.Now let’s understand how
network-manager
works. The microservicenetwork-manager
is responsible for setting up the CNI executable, CNI config so that IPSec stack can start properly.network-manager
queries169.254.169.250/latest/containers
,169.254.169.250/latest/services
,169.254.169.250/latest/stacks
to figure out what containers, services, stacks are running in the cluster. Now to figure out local containers, it relies on the information returned by169.254.169.250/latest/self/host
. Sincenetwork-manager
runs on the host network (--net=host
) all the requests reaching the metadata container have the IP address ofdocker0
bridge as the source IP. This means the source IP seen by metadata is172.17.0.1
in most of the use cases and there is no hit for this IP while looking up metadata’s internal info. But in @ciokan’s setup there are few containers with this IP address, the lookup didn’t miss and metadata was returning information which was incorrect in this context. Since there are multiple entries with the same lookup key, the information from metadata was flapping as observed/mentioned in a previous debugging session with a different user.Workaround: On hosts where the IP address of the
docker0
bridge is not172.17.0.1
, change the docker config and restart docker.Here are steps for a Ubuntu:
Why docker0 bridge has this IP address? May be the host was running an older version of docker when the default IP used for the bridge used to be
172.17.42.1
and later the docker version was upgraded? Not sure. https://github.com/moby/moby/issues/17305Had the time to look around @ciokan systems, thanks a lot for that. Great to get an insight into the actual running problem. So as all the logs hint to a network problem, the root cause seems to be metadata returning inconsistent results. So quering
169.254.169.250/latest/self/host/uuid
would return different results, and this output is actually used to setup networking. We have a lead and are investigating why this is happening.@superseb @tpgxyz I have just solved the problem on our environment.
In our case the problem was, that we have 3 NICs per host and when I was starting the agent, it was auto-detecting wrong agent host IP (IP address of local docker network) which was not accessible from server and the other hosts. The solution which works for me is just providing IP address in 4th step of “Add host” which is accessible from server and the other hosts.
In other words, you need to manually provide CATTLE_AGENT_IP env variable while you are starting an agent. It also means that you need to do it event for host (agent) which will be run on the same host as the server.
sudo docker run -e CATTLE_AGENT_IP="192.168.0.30" --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.2 http://192.168.0.87:7000/v1/scripts/***
After this fix ipsec, healthcheck, cni erros disappear.
Tested on: 1.5.1 and 1.5.10
@pasikarkkainen yes I tried with the ip of the client running the agent. Same error:
Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]
. Docker version doesn’t matter as I see the same error on different hosts running the same engine version. Rancher server picks up the correct agent ip address so I don’t believeCATTLE_AGENT_IP
has anything to do here (tried and tested). I managed to repair 80% of the hosts using my above image but the rest of them refuse to work properly even with it.I wouldn’t be bothered with this bug as it’s no affecting my services but, with it present, my rancher server stays at 100% on all cores (8 of them). If I deactivate the clients with the bug present my rancher server CPU goes down to normal 10% - 20%. With the clients deactivated in rancher server services work ok but, if a client gets restarted, they don’t start any more.
This is the second “most discussed” ticket on this project and we still don’t have a clue, roadmap, timeline or official answer regarding this bug. Everyone involved here provided everything requested, I even collaborated with the dev team and provided SSH access to multiple boxes including rancher server. The culprit seems to be tracked but after that…silence.
We need an official answer as I don’t have a single clue regarding how much we’re supposed to keep waiting. If it’s 1 year I can re-route my team and build something custom. If it’s 1-2 months we can probably wait but the lack of feedback generates a lot of frustration within our office and, I’m sure, other teams affected by this thing.
I can see over 1600 tickets being open which must be overwhelming. I’m happy to pay for support knowing this will get the bug fixed. I’m sure others are holding cash in hand for this.
TL; DR: Give us something. We’re barking at trees here.
@snipking i’m investigating a bit here there is no binaries in /opt/cni/bin dir in rancher/network-manager:v0.6.6
And all of my servers today show such messages like in first message.
I had the same problem and it came from the fact that a machine was used both as a server and as an agent, so in the Hosts tab, it was detected with a wrong IP because it resolved it locally au lieu of resolving the IP in an external way (eg: 172.17.0.1). Which made the machine unaccessible from anywhere.
I stopped the agent where the IP was 172.17.0.1 and
docker run
the agent with :You may have to modify the line a bit if your configuration is different.
Everything works fine now and the right IP shows on the Hosts tab.
LATE EDIT:
Actually is it now told in the “Add Host” page that you’ll have a problem by installing the server and the host on the same server.
You can specify the IP of your server in the input.
I had the same issue yesterday. I have two hosts, and one of them in the same machine of my Rancher. The node I’ve created in the same PC of the server worked well but the other one was giving me the rancher-bridge error. I resolved this removing both hosts and reinstalling them. But first I’ve installed the host from another machine and then the other one (the one in the same machine of the server).
woah…after all this time…my stack is green. I’ll have to thank @leodotcloud and @superseb for taking the time to get to the bottom of it. From the looks of it (leo might step in with more updates) the docker daemon was running on a different ip address than expected, an old one actually. They managed to fix my setup, I can’t give a fix right now because I have bits of it but they fixed about 12 - 15 machines and I’m sure the team now has an idea and can push a fix soon. It’ def a hard thing to reproduce in their testing setup that’s why the long wait over here.
Thank you guys!
@fedya: in your commannd… “CATTLE_AGENT_IP=rancher_server_ip_here” … that doesn’t sound right. CATTLE_AGENT_IP should be IP of the docker host, not the IP of the rancher server … or did I miss something?
@superseb How to confirm https://github.com/rancher/rancher/issues/8383#issuecomment-305400907 ? Can you please share with me a step-by-step procedure ?
Let me explain how things work so maybe people can troubleshoot this better. The “network-manager” container looks at metadata and looks for services with kind “networkDriverService” and then look for local container for that service. Once it find the container it writes a small nsenter shell script that works as the CNI plugin. The code is here, it’s pretty straight forward https://github.com/rancher/plugin-manager/blob/master/binexec/watcher.go. So a couple things that could go wrong
curl -H "Accept: application/json " http://169.254.169.250/latest
. Then send me the output darren@rancher.com.I’m guessing there could be some issue if there is some crufty data around that is causing issues. Sorry this issue went unnoticed for so long, the couple weeks before and after DockerCon are pretty hectic for us. Thanks!
tagging @leodotcloud into this ticket. Expert sleuthing @kaos