rancher: 'Timeout getting IP address' error on healthcheck and ipsec containers in v1.2.0

Rancher Version: 1.2.0

Docker Version: 1.12.1

OS and where are the hosts located? (cloud, bare metal, etc): cloud

Setup Details: single node rancher, external DB

Environment Type: Cattle

Upgraded to Rancher 1.2.0 today after seeing it was in general release. One server has upgraded perfectly well but a second server is continually struggling to startup a ‘healthcheck’ container as well as an ‘ipsec’ container. It also additionally fails to startup a customized ‘logrotate’ container that we use. All three failing containers result in the same error:

Timeout getting IP address

Nothing particularly useful coming from the rancher-server logs:

2016-12-01 16:32:44,719 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195->instanceHostMap:29323] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-29] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=854]: Timeout getting IP address 2016-12-01 16:32:44,719 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195] [instance.start->(InstanceStart)] [] [cutorService-29] [i.c.p.process.instance.InstanceStart] Failed [1/2] to Starting for instance [66195] 2016-12-01 16:32:44,784 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195->instanceHostMap:29323] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-29] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=854]: Timeout getting IP address 2016-12-01 16:32:44,784 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195] [instance.start->(InstanceStart)] [] [cutorService-29] [i.c.p.process.instance.InstanceStart] Failed [2/2] to Starting for instance [66195] 2016-12-01 16:32:46,323 ERROR [:] [] [] [] [cutorService-15] [o.a.c.m.context.NoExceptionRunnable ] Expected state running but got stopped: Timeout getting IP address

Could you please advise on anything we can do to fix the issue or further diagnose?

Many thanks in advance.

NOTE: FOR THOSE RUNNING BOOT2DOCKER If you are running network-services stack with plugin-manager:v0.2.12, please upgrade to the newest version of network services, there will be an “Upgrade Available” button to upgrade to network-manager:v0.2.13.

NOTE: v1.2.0 only works with the docker bridge as docker0 https://github.com/rancher/rancher/issues/6896 and docker must be installed in /var/lib/docker https://github.com/rancher/rancher/issues/6897 We will be making it more flexible in the next release.

NOTE: If neither of the above solutions fix your issue

Can you please run the following commands with the CLI and share the files?

mkdir -p support
rancher hosts -a > support/hosts
rancher logs --tail=-1 ipsec/ipsec > support/ipsec 2>&1
rancher logs --tail=-1 network-services/metadata > support/metadata 2>&1
rancher logs --tail=-1 network-services/network-manager > support/network-manager 2>&1

Please be advised the files will have all your hosts and IP in them.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 26
  • Comments: 117 (11 by maintainers)

Most upvoted comments

Same issue for me. It’s happening on multiple servers and it brought down two of my sites. A general release should not bring downtime like this…

Update

A lot of people are experiencing issues with Rancher w/ a setup where a caching DNS resolver is used on the host. @giovannicandido pointed out is the case with Ubuntu. The root cause of this is basically what I believe is a bug in Docker in which Docker does not set the DNS nameserver properly when only the DnsSearch option is used. For futher reference I’ve created this bug in Docker https://github.com/docker/docker/issues/29815

In Rancher 1.3 (due out this week) we are putting in work around to deal with this behavior in Docker. For Rancher 1.2.x the best approach is to set --dns on the docker daemon command or set the below in /etc/docker/daemon.json:

{
  "dns": ["8.8.8.8", "8.8.4.4"]
}

Sorry for the troubles here. We do test quite extensively on Ubuntu but not with a configuration that has a local dnsmasq resolver and as such we didn’t catch this one.

@samattridge A workaround I found to solve the issue related to /var/lib/docker is to use a mount point instead of a symlink. For example: stop docker service rm -f /var/lib/docker && mkdir -p /var/lib/docker echo “</path/src>        /var/lib/docker         auto    rw,bind        0       0” >> /etc/fstab mount -a start docker service

Hope that helps !

When I add a host in a new environment it takes an enormous amount of time before all services turn green. It’s constantly restarting everything (and I mean everything except the proxy and kubelet services) becuase of “Timeout getting IP address”. After like 10 minutes everything seems to be running finally.

After that I tried adding a deployment from the UI, which depployed a pod that kept being destroyed / removed / restarted, without ever starting a container with my actual application.

Before all that, I upgraded from 1.1.4 and my existing Kubernetes environment got stuck at 7 of 9 services, with all etcd services stuck in initialization + a ton of red dots on my screen.

Also very annoying, once my enviroment is up and running it redirects me to the add your first stack page now and then, randomly, which is btw a broken page, the Add from catalog button is broken…

In my opinion, Rancher 1.2 should not have been released as a stable release yet. We are talking about a piece of software that people use to manage their infrastructure. I’m running a very simple environment with only 3 hosts and no more than 20 containers including k8s stack, yet it got completely wrecked by this update.

To anyone that hasn’t upgraded yet, I would advice to wait for a more stable release.

There are an incredible amount of issues with Rancher 1.2.x we’ve burned a huge amount of time dealing with this. It is clearly not ready for production. You guys need to revert the :stable tag to something stable. You are screwing more and more people over.

Still happens with rancher/server: 1.5.2 and rancher/agent: 1.2.1.

It is something related to the environment (as in cattle environment) creating a new environment and adding hosts to it works just fine, and there’s more, once the host has been added, it can be deleted from the fresh environment and it works just fine in the ‘broken’ environment.

I reproduced quite a few times, getting snapshots of the database in the process

I’ve been seeing the “Timeout getting IP address” issue off and on, some containers schedule, some don’t. After looking at the network-services/metadata logs, I saw that about half of them had the following entries near the end (IPs/hostnames changed):

network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Loading answers"
network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Loaded answers"
network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Applied https://rancher.acme.com/v1/configcontent/metadata-answers?version=15755-8b84aebc80dd7a28cb0d0e082d379dd7"
network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Downloaded in 230.423534ms"
network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Loading answers"
network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Loaded answers"
network-services-metadata-9     | time="2017-01-27T14:08:56Z" level=info msg="Applied https://rancher.acme.com/v1/configcontent/metadata-answers?version=15755-8b84aebc80dd7a28cb0d0e082d379dd7"
network-services-metadata-9     | time="2017-01-27T15:33:03Z" level=info msg="Hit websocket pong timeout. Last websocket ping received at 2017-01-27 15:32:50.015981928 +0000 UTC. Closing connection."
network-services-metadata-9     | time="2017-01-27T15:33:03Z" level=warning msg="websocket closed: websocket: close sent"
network-services-metadata-9     | time="2017-01-27T15:33:04Z" level=warning msg="websocket closed: websocket: close sent"
network-services-metadata-9     | time="2017-01-27T15:33:04Z" level=info msg="Initializing event router" workerCount=3
network-services-metadata-9     | time="2017-01-27T15:33:05Z" level=warning msg="websocket closed: websocket: close sent"
network-services-metadata-9     | time="2017-01-27T15:33:34Z" level=error msg="Failed to download and reload metadata: Get https://rancher.acme.com/v1/configcontent/metadata-answers: dial tcp 10.x.x.x:443: i/o timeout"

After the websocket closes, it doesn’t appear to reopen on it’s own (it’s less alarming to me why the websocket closed). I restarted all the metadata containers that had closed the connection and the containers receiving the “Timeout getting IP address” error finally started.

Is there a reason why the metadata service isn’t either failing and forcing a new metadata container to start or just attempting to reopen the websocket connection?

Hi, I found a solution to this problem that works on ubuntu server, and desktop as well.

The cause is that ubuntu ships by default using dns server as local address 127.0.0.1 and is by design. Problem is docker containers can’t lookup in 127.0.0.1

To diagnose you can do the follow procedure:

See your /etc/resolv.conf file Perform a test:

docker run -it ubuntu bash
apt update
apt install dnsutils
# This will not respond
dig @127.0.0.1 your.hostname.com

Note: ping WILL work fine, and could trick you in thinking that name resolution is working. Dig is a proper way of doing that.

You can also use rancher cli to get a hint of the problem:

mkdir -p support
rancher hosts -a > support/hosts
rancher logs --tail=-1 ipsec/ipsec > support/ipsec 2>&1
rancher logs --tail=-1 network-services/metadata > support/metadata 2>&1
rancher logs --tail=-1 network-services/network-manager > support/network-manager 2>&1

The error will be like network-services-metadata-1 | time="2016-12-26T23:58:21Z" level=fatal msg="Failed to subscribeGet http://xx.xxx.xxx:xxx/v2-beta: dial tcp: lookup xxx.xxx.xxx.xxx on 127.0.0.1:53: read udp 127.0.0.1:44285->127.0.0.1:53: read: connection refused"

There are two solutions: 1 - Configure ubuntu to use other nameserver like google public dns (8.8.8.8, 8.8.4.4). I try this one, and is by far too complicated for a simple change, as a said, ubuntu use that by design. 2 - Change docker dns server. This worked fine for me. You will edit or create the file /etc/docker/daemon.json and put the line:

{
  "dns": ["8.8.8.8", "8.8.4.4"]
}

Stop the containers and Restart the daemon:

docker stop $(docker ps -q)
docker stop $(docker ps -q) # yes twice :-) rancher will try do restart your dying containers
systemctl restart docker

Oh boy, after days in this problem, I will drink a could beer 😃

Thank you @deniseschannon, your procedure of using rancher cli to generate logs was what give me the hint necessary to find the cause.

Question: Is there some problem in using a external DNS, with rancher ?

And finally, for the ones who want the solution 1, I let some links to get started (none of the solutions worked for me 😦

http://askubuntu.com/questions/143819/how-do-i-configure-my-static-dns-in-interfaces http://askubuntu.com/questions/327532/why-would-127-0-0-1-in-resolv-conf-cause-problems-in-dns-resolution http://askubuntu.com/questions/627899/nameserver-127-0-1-1-in-resolv-conf-wont-go-away

I just recreated the full environment (120 hosts) and rancher is still crapping out on the infrastructure services (ipsec, network-services, scheduler) with a ton of errors like this:

2016-12-01 22:32:32,978 ERROR [902d8b92-a493-4a32-b12c-6ff1de043257:4900] [instance:346] [instance.start->(InstanceStart)] [] [cutorService-18] [i.c.p.process.instance.InstanceStart] Failed to Waiting for deployment unit instances to create for instance [346]

As of now, both my sites have been down for 7 hours, and clients are mad. This is NOT what a stable upgrade should look like. After I finish fixing this mess, I’ll look for more stable and serious alternatives

What the f…g problem. It still happens every time after system upgrade or rancher upgrade. Sometimes it happens after reboot host. I use rancher from beta version… it was 0.6 and it is best open-source UI GUI for docker, kubernetes and other services for DevOps. But why you, rancher team, can not to resolve this problem with your Cattle Stack. I don’t want to remove and recreating on all hosts new environment every time, after my home-server rebooting. I will have good sleeping and don’t think about possible situations or pray to Digital God for success starting services and containers after electricity interruptions. Can you research this problem and describe all possible variants of guides how to issue this global problem! I hope that you’l decide to do it right.

For anybody else not running b2d, can you please run the following commands with the CLI and share the files.

mkdir -p support
rancher hosts -a > support/hosts
rancher logs --tail=-1 ipsec/ipsec > support/ipsec 2>&1
rancher logs --tail=-1 network-services/metadata > support/metadata 2>&1
rancher logs --tail=-1 network-services/network-manager > support/network-manager 2>&1

Please be advised the files will have all your hosts and IP in them.

It appears b2d does not work. To confirm that this is your issue, run ln -s / /mnt/sda1 in the network-manager container. After a bit the ipsec container will come up. We will work on a fix for this.

Just want to add I have the same issues and that the latest version 1.2.1 version does not solve it for me. I tried it with various servers on Digital Ocean (Ubuntu, CoreOS etc.) and tried the mentioned hints above. (Sent my logs too). Did not work for me and I had no choice but to downgrade to 1.1.4 1.2 is really not a resilient/stable version for me.

Anyone using boot2docker hosts must be running network services v0.0.2, which is running rancher/network-manager:v0.2.13. This will be the new default for new installs, but if you already have Rancher deployed and there is an “Upgrade Available” next to your “Network Services” stack, please upgrade to get your networking working.

For other people having issues, v1.2.0 only works with your docker bridge as docker0 https://github.com/rancher/rancher/issues/6896 and docker being installed in var/lib/docker https://github.com/rancher/rancher/issues/6897. We are looking to make this more flexible in our next release.

@tbohnen the ln fix only solves the issue on boot2Docker

@ibuildthecloud ln -s / /mnt/sda1 works! All infrastructure stacks comes online now.

I am experiencing the same issue. I can not get the ipsec or scheduler container to start properly. I keep seeing the Timeout error mentioned above.

I am using boot2docker hosts with docker 1.12.1

Has anyone actually been able to get Rancher 1.2.0 to work correctly? I have re-installed everything from scratch and rebuilt my server with new database, removed all containers from hosts, etc…Basically a clean slate.

I have tried adding anywhere from 1 to 10 hosts. All with the same result…

Theory: it takes around one minute download and apply the metadata. So in some cases it can take over 2 minutes until an IP is visible. This runs into the 120 sec timeout of the agent.

Some logs: 31.1.2017 20:18:40time="2017-01-31T19:18:40Z" level=info msg="Applied https://xxx/v1/configcontent/metadata-answers?version=44878-8b84aebc80dd7a28cb0d0e082d379dd7" 31.1.2017 20:19:00time="2017-01-31T19:19:00Z" level=info msg="Downloaded in 19.787896964s" 31.1.2017 20:19:20time="2017-01-31T19:19:20Z" level=info msg="Loading answers" 31.1.2017 20:19:21time="2017-01-31T19:19:21Z" level=info msg="Loaded answers" 31.1.2017 20:19:21time="2017-01-31T19:19:21Z" level=info msg="Applied https://xxx/v1/configcontent/metadata-answers?version=44922-8b84aebc80dd7a28cb0d0e082d379dd7" 31.1.2017 20:19:45time="2017-01-31T19:19:45Z" level=info msg="Downloaded in 23.568940433s" 31.1.2017 20:20:02time="2017-01-31T19:20:02Z" level=info msg="Loading answers" 31.1.2017 20:20:03time="2017-01-31T19:20:03Z" level=info msg="Loaded answers" 31.1.2017 20:20:03time="2017-01-31T19:20:03Z" level=info msg="Applied https://xxx/v1/configcontent/metadata-answers?version=44958-8b84aebc80dd7a28cb0d0e082d379dd7" 31.1.2017 20:20:22time="2017-01-31T19:20:22Z" level=info msg="Downloaded in 18.835192201s" 31.1.2017 20:20:42time="2017-01-31T19:20:42Z" level=info msg="Loading answers"

This has happened to me once in two days for last couple of weeks. Restart of the node helps, but i found out that restart of network-manager and dns helps also

@DomiStyle I’ve been having the exact same problem with rancher 1.3.0 and 1.3.3, I finally tried your approach and started looking at DNS traffic with tcp dump and my failing host did indeed look up rancher-metadata in all the search domains of the host, until it got to the public company domain which has a * record that resolves to an external website.

My solution was to remove the internet-only domain from /etc/resolv.conf and that caused ipsec to start, now I just have healthcheck stuck in initializing.

I really wish the internal names looked up by services such as ipsec will be changed to use the fqdn ending in .internal. so we can avoid both spamming the external DNS with useless requests and this problem in particular.

I tested with clean db (because #7245) and no errors happens

@jwhitcraft Thanks, I also realize that while -v /etc/resolv.conf:/etc/resolv.conf has a really bad behavior in Docker 1.12, it won’t exactly cause the issue you are seeing.

@jwhitcraft Looking at your answers.json I can see that rancher-dns is in fact recursing to itself “169.254.169.250”. That is basically what is causing your issue. Now why it’s configured as such, no clue. Let me come up with some theories.

@jwhitcraft

mkdir -p support
rancher hosts -a > support/hosts
rancher logs --tail=-1 ipsec/ipsec > support/ipsec 2>&1
rancher logs --tail=-1 network-services/metadata > support/metadata 2>&1
rancher logs --tail=-1 network-services/network-manager > support/network-manager 2>&1

This issue kept several of our bare-metal Ubuntu 16.04 servers off network for over a week.

A work-around for me was:

  1. Remove any reference to rancher (kill and remove containers), and umount /var/lib/rancher/volumes
  2. rm -rf /var/lib/rancher
  3. apt-get purge docker-engine
  4. rm -rf /var/lib/docker
  5. apt-get install docker-engine
  6. install rancher-agent (however you choose).

.

@cjellick I’ll try and sort out the logs this morning. Something that may help you diagnose the issue is that I just read issue #6897 regarding changing docker location from the default of /var/lib/docker. On the machine that is seeing the error I have symbolic linked that location so we can run it on another drive without changing too much configuration. I’m assuming this could be the reason why I’m seeing this issue in particular but may not be the answer for others.

Unfortunately, we don’t have much option other than to change the location of docker since we’re using Azure and want to use a secondary SSD backed drive for the containers.

@ibuildthecloud

ipsec log turned out empty.

metadata is pretty much only logging this over and over again:

network-services-metadata-1     | time="2016-12-01T22:35:12Z" level=info msg="Downloaded in 543.187019ms"
network-services-metadata-1     | time="2016-12-01T22:35:12Z" level=info msg="Loading answers"
network-services-metadata-1     | time="2016-12-01T22:35:12Z" level=info msg="Loaded answers"
network-services-metadata-1     | time="2016-12-01T22:35:12Z" level=info msg="Applied https://ranchermasters.ops.example.com/v1/configcontent/metadata-answers?version=2649-eb849cc452eab2e95f5d30ac7decee95"

network-manager however is logging a bunch of errors mainly about problems reading resolv.conf and nsenter: cannot open /proc/6899/ns/ipc: No such file or directory

network-manager.txt