rancher: Failed to evaluate networking: failed to find plugin \"rancher-bridge\" when metadata container's bridge IP is equal to another hosts docker0 IP (bip)

Rancher Versions: Server: 1.5.2 healthcheck: 0.2.3 ipsec: rancher/net:holder network-services: rancher/network-manager:v0.5.3 scheduler: rancher/scheduler:v0.7.5 kubernetes (if applicable):

Docker Version: 1.12.6

OS and where are the hosts located? (cloud, bare metal, etc): Ubuntu 16.04/Bare Metal/Multiple locations

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB) Single node rancher, external DB

Environment Type: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce: I have about 100 hosts running 4 containers. On 5-6 of them ipsec runs ok, on every other host the ipsec stack fails (also the healthcheck) with Timeout getting ip. Inside the network-manager logs I can see:

nsenter: cannot open /proc/26034/ns/ipc: No such file or directory
time="2017-04-01T02:31:01Z" level=error msg="Failed to evaluate network state for 059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 
time="2017-04-01T02:31:01Z" level=error msg="Error processing event &docker.APIEvents{Action:\"start\", Type:\"container\", Actor:docker.APIActor{ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", Attributes:map[string]string{\"io.rancher.container.agent_id\":\"58465\", \"io.rancher.project_service.name\":\"healthcheck/healthcheck\", \"io.rancher.service.hash\":\"d0a8fd4061d3b2a8c5782f5563db6df0b25655cb\", \"io.rancher.stack_service.name\":\"healthcheck/healthcheck\", \"io.rancher.cni.network\":\"ipsec\", \"io.rancher.scheduler.global\":\"true\", \"io.rancher.service.requested.host.id\":\"209\", \"io.rancher.stack.name\":\"healthcheck\", \"name\":\"r-healthcheck-healthcheck-28-2576d08e\", \"image\":\"rancher/healthcheck:v0.2.3\", \"io.rancher.container.create_agent\":\"true\", \"io.rancher.container.ip\":\"10.42.112.234/16\", \"io.rancher.container.mac_address\":\"02:4f:71:44:58:1a\", \"io.rancher.service.launch.config\":\"io.rancher.service.primary.launch.config\", \"io.rancher.cni.wait\":\"true\", \"io.rancher.container.system\":\"true\", \"io.rancher.container.uuid\":\"2576d08e-3100-4935-aa34-3ecb73da302a\", \"io.rancher.project.name\":\"healthcheck\", \"io.rancher.service.deployment.unit\":\"8a360a36-5ff6-497f-a7db-92193ecf2415\", \"io.rancher.container.name\":\"healthcheck-healthcheck-28\"}}, Status:\"start\", ID:\"059fae587888cf4a80bc7020e0ab684927783790703d73591edf07fdf0f6e769\", From:\"rancher/healthcheck:v0.2.3\", Time:1491013861, TimeNano:1491013861288487613}. Error: netplugin failed but error parsing its diagnostic message \"\": unexpected end of JSON input" 

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 11
  • Comments: 102 (10 by maintainers)

Most upvoted comments

We’ve been struggling all day with ipsec issues on one of the nodes after upgrading a rancher environment running 1.4 to 1.5.6 (in other words, we have several nodes where ipsec was working).

After a lot of headscratching and cussing, we got ipsec working following these simple steps:

  • copy /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge from one of the working nodes (we were missing this file on the node with ipsec issues).
  • the file is tiny, a one line script calling nsenter, and needs the PID of the target process it should enter. this needs to be adjusted for each host and time.
  • so, replace the pid for the -t option to nsenter with the pid of ps ax | grep rancher-cni (it should be the sidekick tailing the log file:
[root@xxx bin]# ps ax | grep rancher-cni
2754 ?        Ss     0:00 tail ---disable-inotify -F /var/log/rancher-cni.log
  • contents of /var/lib/docker/volumes/rancher-cni-driver/_data/bin/rancher-bridge after patching:
[root@xxx bin]# cat rancher-bridge 
#!/bin/sh
exec /usr/bin/nsenter -m -u -i -n -p -t 2754 -- $0 "$@"
  • now, restart network-manager and things starts rolling!! 😉

So, in conclusion, it seems like the process responsible for updating this file is failing, for some to us yet unknown reason (haven’t had time to look into that yet).

Just got off call with @fedya … The reason for his problem was different than above.

Summary: Host 1: Running both rancher/server and rancher/agent … And this caused the IP detected for this host in Cattle as 172.17.0.1. Host 2: This is the host having problem "Error … "

When running both server and agent, it’s needed to pass CATTLE_AGENT_IP as the public IP address of the host while registering the host.

Fix:

  • Delete rancher-agent container on Host1
  • Start rancher-agent again on Host1, but this time pass CATTLE_AGENT_IP=PUBLIC_IP_OF_HOST_1
  • Things started fine on both Host 1 and Host 2.

@ciokan Thank you for taking time to share your setup for debugging.

The error message: Failed to evaluate networking: failed to find plugin \"rancher-bridge\" could be due to a variety of problems. One of the reasons is as below:

Summary: TL; DR version: If the IP address of the docker0 bridge on any host is same as the IP address assigned to any container, especially metadata container, in the cluster this problem surfaces.

Long version: All of Rancher’s infra microservices/containers running on any host of the cluster rely on metadata to fetch the information needed to do their jobs. rancher-metadata answers the queries in an intelligent fashion, giving out information based on the source IP address of the request. So 169.254.169.250/latest/self/container/ will yield different answers for various containers. If the source IP address lookup fails in rancher-metadata only parts of the information which are common to all the containers on that host are available while querying. Ex: Only 169.254.169.250/latest/self/host is available but not 169.254.169.250/latest/self/container as we cannot determine from where the request is coming from.

When a new host is added to rancher, the first microservice to start would be rancher-metadata along with it’s sidekick rancher-dns. This microservice is on docker’s bridge network and gets an IP address from the subnet configured for docker. Example: If the bridge subnet is 172.17.0.0/16, an IP address in the range 172.17.0.1 to 172.17.255.254 is allocated for this container. In general the first IP address of this container, i.e., 172.17.0.1 is configured on the bridge, rancher-metadata generally/most of the times gets 172.17.0.2. While debugging @ciokan’s setup, we noticed that some of the rancher-metadata containers in the cluster had an IP address of 172.17.0.1, which was quite unusual and upon further investigation we found docker0 bridge had an IP: 172.17.42.1.

Now let’s understand how network-manager works. The microservice network-manager is responsible for setting up the CNI executable, CNI config so that IPSec stack can start properly. network-manager queries 169.254.169.250/latest/containers, 169.254.169.250/latest/services, 169.254.169.250/latest/stacks to figure out what containers, services, stacks are running in the cluster. Now to figure out local containers, it relies on the information returned by 169.254.169.250/latest/self/host. Since network-manager runs on the host network (--net=host) all the requests reaching the metadata container have the IP address of docker0 bridge as the source IP. This means the source IP seen by metadata is 172.17.0.1 in most of the use cases and there is no hit for this IP while looking up metadata’s internal info. But in @ciokan’s setup there are few containers with this IP address, the lookup didn’t miss and metadata was returning information which was incorrect in this context. Since there are multiple entries with the same lookup key, the information from metadata was flapping as observed/mentioned in a previous debugging session with a different user.

Workaround: On hosts where the IP address of the docker0 bridge is not 172.17.0.1, change the docker config and restart docker.

Here are steps for a Ubuntu:

cat > /etc/docker/daemon.json << EOF
{
    "bip": "172.17.0.1/16",
    "fixed-cidr": "172.17.0.0/16"
}
EOF

service docker stop
rm -rf /var/run/docker.sock
ps -ef | grep docker 
# make sure none of the old docker processes are running.
# if yes do kill -9 ____
service docker start
docker ps
ip addr show docker0

Why docker0 bridge has this IP address? May be the host was running an older version of docker when the default IP used for the bridge used to be 172.17.42.1 and later the docker version was upgraded? Not sure. https://github.com/moby/moby/issues/17305

Had the time to look around @ciokan systems, thanks a lot for that. Great to get an insight into the actual running problem. So as all the logs hint to a network problem, the root cause seems to be metadata returning inconsistent results. So quering 169.254.169.250/latest/self/host/uuid would return different results, and this output is actually used to setup networking. We have a lead and are investigating why this is happening.

@superseb @tpgxyz I have just solved the problem on our environment.

In our case the problem was, that we have 3 NICs per host and when I was starting the agent, it was auto-detecting wrong agent host IP (IP address of local docker network) which was not accessible from server and the other hosts. The solution which works for me is just providing IP address in 4th step of “Add host” which is accessible from server and the other hosts.

In other words, you need to manually provide CATTLE_AGENT_IP env variable while you are starting an agent. It also means that you need to do it event for host (agent) which will be run on the same host as the server.

sudo docker run -e CATTLE_AGENT_IP="192.168.0.30" --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.2 http://192.168.0.87:7000/v1/scripts/***

After this fix ipsec, healthcheck, cni erros disappear.

Tested on: 1.5.1 and 1.5.10

@pasikarkkainen yes I tried with the ip of the client running the agent. Same error: Couldn't bring up network: failed to find plugin "rancher-bridge" in path [/opt/cni/bin /var/lib/cni/bin /usr/local/sbin /usr/sbin /sbin /usr/local/bin /usr/bin /bin]. Docker version doesn’t matter as I see the same error on different hosts running the same engine version. Rancher server picks up the correct agent ip address so I don’t believe CATTLE_AGENT_IP has anything to do here (tried and tested). I managed to repair 80% of the hosts using my above image but the rest of them refuse to work properly even with it.

I wouldn’t be bothered with this bug as it’s no affecting my services but, with it present, my rancher server stays at 100% on all cores (8 of them). If I deactivate the clients with the bug present my rancher server CPU goes down to normal 10% - 20%. With the clients deactivated in rancher server services work ok but, if a client gets restarted, they don’t start any more.

This is the second “most discussed” ticket on this project and we still don’t have a clue, roadmap, timeline or official answer regarding this bug. Everyone involved here provided everything requested, I even collaborated with the dev team and provided SSH access to multiple boxes including rancher server. The culprit seems to be tracked but after that…silence.

We need an official answer as I don’t have a single clue regarding how much we’re supposed to keep waiting. If it’s 1 year I can re-route my team and build something custom. If it’s 1-2 months we can probably wait but the lack of feedback generates a lot of frustration within our office and, I’m sure, other teams affected by this thing.

I can see over 1600 tickets being open which must be overwhelming. I’m happy to pay for support knowing this will get the bug fixed. I’m sure others are holding cash in hand for this.

TL; DR: Give us something. We’re barking at trees here.

@snipking i’m investigating a bit here there is no binaries in /opt/cni/bin dir in rancher/network-manager:v0.6.6

And all of my servers today show such messages like in first message.

I had the same problem and it came from the fact that a machine was used both as a server and as an agent, so in the Hosts tab, it was detected with a wrong IP because it resolved it locally au lieu of resolving the IP in an external way (eg: 172.17.0.1). Which made the machine unaccessible from anywhere.

I stopped the agent where the IP was 172.17.0.1 and docker run the agent with :

-e CATTLE_AGENT_IP="$(ifconfig eth0 | grep 'inet ' | cut -d ' ' -f10 | awk '{ print $1}')"

You may have to modify the line a bit if your configuration is different.

Everything works fine now and the right IP shows on the Hosts tab.

LATE EDIT:

Actually is it now told in the “Add Host” page that you’ll have a problem by installing the server and the host on the same server. image

You can specify the IP of your server in the input.

I had the same issue yesterday. I have two hosts, and one of them in the same machine of my Rancher. The node I’ve created in the same PC of the server worked well but the other one was giving me the rancher-bridge error. I resolved this removing both hosts and reinstalling them. But first I’ve installed the host from another machine and then the other one (the one in the same machine of the server).

woah…after all this time…my stack is green. I’ll have to thank @leodotcloud and @superseb for taking the time to get to the bottom of it. From the looks of it (leo might step in with more updates) the docker daemon was running on a different ip address than expected, an old one actually. They managed to fix my setup, I can’t give a fix right now because I have bits of it but they fixed about 12 - 15 machines and I’m sure the team now has an idea and can push a fix soon. It’ def a hard thing to reproduce in their testing setup that’s why the long wait over here.

Thank you guys!

@fedya: in your commannd… “CATTLE_AGENT_IP=rancher_server_ip_here” … that doesn’t sound right. CATTLE_AGENT_IP should be IP of the docker host, not the IP of the rancher server … or did I miss something?

@superseb How to confirm https://github.com/rancher/rancher/issues/8383#issuecomment-305400907 ? Can you please share with me a step-by-step procedure ?

Let me explain how things work so maybe people can troubleshoot this better. The “network-manager” container looks at metadata and looks for services with kind “networkDriverService” and then look for local container for that service. Once it find the container it writes a small nsenter shell script that works as the CNI plugin. The code is here, it’s pretty straight forward https://github.com/rancher/plugin-manager/blob/master/binexec/watcher.go. So a couple things that could go wrong

  1. Metadata isn’t working: Look at the docker logs output of metadata and see if anything is fishy
  2. Metadata has the wrong info in it: Go into the network-manager container and run curl -H "Accept: application/json " http://169.254.169.250/latest. Then send me the output darren@rancher.com.

I’m guessing there could be some issue if there is some crufty data around that is causing issues. Sorry this issue went unnoticed for so long, the couple weeks before and after DockerCon are pretty hectic for us. Thanks!

tagging @leodotcloud into this ticket. Expert sleuthing @kaos