rancher: 'Timeout getting IP address' error on healthcheck and ipsec containers in v1.2.0
Rancher Version: 1.2.0
Docker Version: 1.12.1
OS and where are the hosts located? (cloud, bare metal, etc): cloud
Setup Details: single node rancher, external DB
Environment Type: Cattle
Upgraded to Rancher 1.2.0 today after seeing it was in general release. One server has upgraded perfectly well but a second server is continually struggling to startup a ‘healthcheck’ container as well as an ‘ipsec’ container. It also additionally fails to startup a customized ‘logrotate’ container that we use. All three failing containers result in the same error:
Timeout getting IP address
Nothing particularly useful coming from the rancher-server logs:
2016-12-01 16:32:44,719 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195->instanceHostMap:29323] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-29] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=854]: Timeout getting IP address 2016-12-01 16:32:44,719 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195] [instance.start->(InstanceStart)] [] [cutorService-29] [i.c.p.process.instance.InstanceStart] Failed [1/2] to Starting for instance [66195] 2016-12-01 16:32:44,784 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195->instanceHostMap:29323] [instance.start->(InstanceStart)->instancehostmap.activate] [] [cutorService-29] [c.p.e.p.i.DefaultProcessInstanceImpl] Agent error for [compute.instance.activate.reply;agent=854]: Timeout getting IP address 2016-12-01 16:32:44,784 ERROR [7949843f-82d7-486d-a347-97224a8d0415:1411369] [instance:66195] [instance.start->(InstanceStart)] [] [cutorService-29] [i.c.p.process.instance.InstanceStart] Failed [2/2] to Starting for instance [66195] 2016-12-01 16:32:46,323 ERROR [:] [] [] [] [cutorService-15] [o.a.c.m.context.NoExceptionRunnable ] Expected state running but got stopped: Timeout getting IP address
Could you please advise on anything we can do to fix the issue or further diagnose?
Many thanks in advance.
NOTE: FOR THOSE RUNNING BOOT2DOCKER If you are running network-services stack with plugin-manager:v0.2.12, please upgrade to the newest version of network services, there will be an “Upgrade Available” button to upgrade to network-manager:v0.2.13.
NOTE: v1.2.0 only works with the docker bridge as docker0 https://github.com/rancher/rancher/issues/6896 and docker must be installed in /var/lib/docker https://github.com/rancher/rancher/issues/6897
We will be making it more flexible in the next release.
NOTE: If neither of the above solutions fix your issue
Can you please run the following commands with the CLI and share the files?
mkdir -p support
rancher hosts -a > support/hosts
rancher logs --tail=-1 ipsec/ipsec > support/ipsec 2>&1
rancher logs --tail=-1 network-services/metadata > support/metadata 2>&1
rancher logs --tail=-1 network-services/network-manager > support/network-manager 2>&1
Please be advised the files will have all your hosts and IP in them.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 26
- Comments: 117 (11 by maintainers)
Same issue for me. It’s happening on multiple servers and it brought down two of my sites. A general release should not bring downtime like this…
Update
A lot of people are experiencing issues with Rancher w/ a setup where a caching DNS resolver is used on the host. @giovannicandido pointed out is the case with Ubuntu. The root cause of this is basically what I believe is a bug in Docker in which Docker does not set the DNS nameserver properly when only the
DnsSearchoption is used. For futher reference I’ve created this bug in Docker https://github.com/docker/docker/issues/29815In Rancher 1.3 (due out this week) we are putting in work around to deal with this behavior in Docker. For Rancher 1.2.x the best approach is to set
--dnson thedocker daemoncommand or set the below in/etc/docker/daemon.json:Sorry for the troubles here. We do test quite extensively on Ubuntu but not with a configuration that has a local dnsmasq resolver and as such we didn’t catch this one.
@samattridge A workaround I found to solve the issue related to /var/lib/docker is to use a mount point instead of a symlink. For example: stop docker service rm -f /var/lib/docker && mkdir -p /var/lib/docker echo “</path/src> /var/lib/docker auto rw,bind 0 0” >> /etc/fstab mount -a start docker service
Hope that helps !
When I add a host in a new environment it takes an enormous amount of time before all services turn green. It’s constantly restarting everything (and I mean everything except the proxy and kubelet services) becuase of “Timeout getting IP address”. After like 10 minutes everything seems to be running finally.
After that I tried adding a deployment from the UI, which depployed a pod that kept being destroyed / removed / restarted, without ever starting a container with my actual application.
Before all that, I upgraded from 1.1.4 and my existing Kubernetes environment got stuck at 7 of 9 services, with all etcd services stuck in initialization + a ton of red dots on my screen.
Also very annoying, once my enviroment is up and running it redirects me to the add your first stack page now and then, randomly, which is btw a broken page, the Add from catalog button is broken…
In my opinion, Rancher 1.2 should not have been released as a stable release yet. We are talking about a piece of software that people use to manage their infrastructure. I’m running a very simple environment with only 3 hosts and no more than 20 containers including k8s stack, yet it got completely wrecked by this update.
To anyone that hasn’t upgraded yet, I would advice to wait for a more stable release.
There are an incredible amount of issues with Rancher 1.2.x we’ve burned a huge amount of time dealing with this. It is clearly not ready for production. You guys need to revert the :stable tag to something stable. You are screwing more and more people over.
Still happens with rancher/server: 1.5.2 and rancher/agent: 1.2.1.
It is something related to the environment (as in cattle environment) creating a new environment and adding hosts to it works just fine, and there’s more, once the host has been added, it can be deleted from the fresh environment and it works just fine in the ‘broken’ environment.
I reproduced quite a few times, getting snapshots of the database in the process
I’ve been seeing the “Timeout getting IP address” issue off and on, some containers schedule, some don’t. After looking at the
network-services/metadatalogs, I saw that about half of them had the following entries near the end (IPs/hostnames changed):After the websocket closes, it doesn’t appear to reopen on it’s own (it’s less alarming to me why the websocket closed). I restarted all the metadata containers that had closed the connection and the containers receiving the “Timeout getting IP address” error finally started.
Is there a reason why the metadata service isn’t either failing and forcing a new metadata container to start or just attempting to reopen the websocket connection?
Hi, I found a solution to this problem that works on ubuntu server, and desktop as well.
The cause is that ubuntu ships by default using dns server as local address 127.0.0.1 and is by design. Problem is docker containers can’t lookup in 127.0.0.1
To diagnose you can do the follow procedure:
See your /etc/resolv.conf file Perform a test:
Note: ping WILL work fine, and could trick you in thinking that name resolution is working. Dig is a proper way of doing that.
You can also use rancher cli to get a hint of the problem:
The error will be like
network-services-metadata-1 | time="2016-12-26T23:58:21Z" level=fatal msg="Failed to subscribeGet http://xx.xxx.xxx:xxx/v2-beta: dial tcp: lookup xxx.xxx.xxx.xxx on 127.0.0.1:53: read udp 127.0.0.1:44285->127.0.0.1:53: read: connection refused"There are two solutions: 1 - Configure ubuntu to use other nameserver like google public dns (8.8.8.8, 8.8.4.4). I try this one, and is by far too complicated for a simple change, as a said, ubuntu use that by design. 2 - Change docker dns server. This worked fine for me. You will edit or create the file /etc/docker/daemon.json and put the line:
Stop the containers and Restart the daemon:
Oh boy, after days in this problem, I will drink a could beer 😃
Thank you @deniseschannon, your procedure of using rancher cli to generate logs was what give me the hint necessary to find the cause.
Question: Is there some problem in using a external DNS, with rancher ?
And finally, for the ones who want the solution 1, I let some links to get started (none of the solutions worked for me 😦
http://askubuntu.com/questions/143819/how-do-i-configure-my-static-dns-in-interfaces http://askubuntu.com/questions/327532/why-would-127-0-0-1-in-resolv-conf-cause-problems-in-dns-resolution http://askubuntu.com/questions/627899/nameserver-127-0-1-1-in-resolv-conf-wont-go-away
I just recreated the full environment (120 hosts) and rancher is still crapping out on the infrastructure services (ipsec, network-services, scheduler) with a ton of errors like this:
2016-12-01 22:32:32,978 ERROR [902d8b92-a493-4a32-b12c-6ff1de043257:4900] [instance:346] [instance.start->(InstanceStart)] [] [cutorService-18] [i.c.p.process.instance.InstanceStart] Failed to Waiting for deployment unit instances to create for instance [346]As of now, both my sites have been down for 7 hours, and clients are mad. This is NOT what a stable upgrade should look like. After I finish fixing this mess, I’ll look for more stable and serious alternatives
What the f…g problem. It still happens every time after system upgrade or rancher upgrade. Sometimes it happens after reboot host. I use rancher from beta version… it was 0.6 and it is best open-source UI GUI for docker, kubernetes and other services for DevOps. But why you, rancher team, can not to resolve this problem with your Cattle Stack. I don’t want to remove and recreating on all hosts new environment every time, after my home-server rebooting. I will have good sleeping and don’t think about possible situations or pray to Digital God for success starting services and containers after electricity interruptions. Can you research this problem and describe all possible variants of guides how to issue this global problem! I hope that you’l decide to do it right.
For anybody else not running b2d, can you please run the following commands with the CLI and share the files.
Please be advised the files will have all your hosts and IP in them.
It appears b2d does not work. To confirm that this is your issue, run
ln -s / /mnt/sda1in thenetwork-managercontainer. After a bit the ipsec container will come up. We will work on a fix for this.Just want to add I have the same issues and that the latest version 1.2.1 version does not solve it for me. I tried it with various servers on Digital Ocean (Ubuntu, CoreOS etc.) and tried the mentioned hints above. (Sent my logs too). Did not work for me and I had no choice but to downgrade to 1.1.4 1.2 is really not a resilient/stable version for me.
Anyone using boot2docker hosts must be running network services v0.0.2, which is running rancher/network-manager:v0.2.13. This will be the new default for new installs, but if you already have Rancher deployed and there is an “Upgrade Available” next to your “Network Services” stack, please upgrade to get your networking working.
For other people having issues, v1.2.0 only works with your docker bridge as docker0 https://github.com/rancher/rancher/issues/6896 and docker being installed in
var/lib/dockerhttps://github.com/rancher/rancher/issues/6897. We are looking to make this more flexible in our next release.@tbohnen the ln fix only solves the issue on boot2Docker
@ibuildthecloud
ln -s / /mnt/sda1works! All infrastructure stacks comes online now.I am experiencing the same issue. I can not get the ipsec or scheduler container to start properly. I keep seeing the Timeout error mentioned above.
I am using boot2docker hosts with docker 1.12.1
Has anyone actually been able to get Rancher 1.2.0 to work correctly? I have re-installed everything from scratch and rebuilt my server with new database, removed all containers from hosts, etc…Basically a clean slate.
I have tried adding anywhere from 1 to 10 hosts. All with the same result…
Theory: it takes around one minute download and apply the metadata. So in some cases it can take over 2 minutes until an IP is visible. This runs into the 120 sec timeout of the agent.
Some logs:
31.1.2017 20:18:40time="2017-01-31T19:18:40Z" level=info msg="Applied https://xxx/v1/configcontent/metadata-answers?version=44878-8b84aebc80dd7a28cb0d0e082d379dd7" 31.1.2017 20:19:00time="2017-01-31T19:19:00Z" level=info msg="Downloaded in 19.787896964s" 31.1.2017 20:19:20time="2017-01-31T19:19:20Z" level=info msg="Loading answers" 31.1.2017 20:19:21time="2017-01-31T19:19:21Z" level=info msg="Loaded answers" 31.1.2017 20:19:21time="2017-01-31T19:19:21Z" level=info msg="Applied https://xxx/v1/configcontent/metadata-answers?version=44922-8b84aebc80dd7a28cb0d0e082d379dd7" 31.1.2017 20:19:45time="2017-01-31T19:19:45Z" level=info msg="Downloaded in 23.568940433s" 31.1.2017 20:20:02time="2017-01-31T19:20:02Z" level=info msg="Loading answers" 31.1.2017 20:20:03time="2017-01-31T19:20:03Z" level=info msg="Loaded answers" 31.1.2017 20:20:03time="2017-01-31T19:20:03Z" level=info msg="Applied https://xxx/v1/configcontent/metadata-answers?version=44958-8b84aebc80dd7a28cb0d0e082d379dd7" 31.1.2017 20:20:22time="2017-01-31T19:20:22Z" level=info msg="Downloaded in 18.835192201s" 31.1.2017 20:20:42time="2017-01-31T19:20:42Z" level=info msg="Loading answers"This has happened to me once in two days for last couple of weeks. Restart of the node helps, but i found out that restart of network-manager and dns helps also
@DomiStyle I’ve been having the exact same problem with rancher 1.3.0 and 1.3.3, I finally tried your approach and started looking at DNS traffic with tcp dump and my failing host did indeed look up rancher-metadata in all the search domains of the host, until it got to the public company domain which has a * record that resolves to an external website.
My solution was to remove the internet-only domain from /etc/resolv.conf and that caused ipsec to start, now I just have healthcheck stuck in initializing.
I really wish the internal names looked up by services such as ipsec will be changed to use the fqdn ending in .internal. so we can avoid both spamming the external DNS with useless requests and this problem in particular.
I tested with clean db (because #7245) and no errors happens
@jwhitcraft Thanks, I also realize that while
-v /etc/resolv.conf:/etc/resolv.confhas a really bad behavior in Docker 1.12, it won’t exactly cause the issue you are seeing.@jwhitcraft Looking at your answers.json I can see that rancher-dns is in fact recursing to itself “169.254.169.250”. That is basically what is causing your issue. Now why it’s configured as such, no clue. Let me come up with some theories.
@jwhitcraft
This issue kept several of our bare-metal Ubuntu 16.04 servers off network for over a week.
A work-around for me was:
.
@cjellick I’ll try and sort out the logs this morning. Something that may help you diagnose the issue is that I just read issue #6897 regarding changing docker location from the default of /var/lib/docker. On the machine that is seeing the error I have symbolic linked that location so we can run it on another drive without changing too much configuration. I’m assuming this could be the reason why I’m seeing this issue in particular but may not be the answer for others.
Unfortunately, we don’t have much option other than to change the location of docker since we’re using Azure and want to use a secondary SSD backed drive for the containers.
@ibuildthecloud
ipsec log turned out empty.
metadata is pretty much only logging this over and over again:
network-manager however is logging a bunch of errors mainly about problems reading resolv.conf and
nsenter: cannot open /proc/6899/ns/ipc: No such file or directorynetwork-manager.txt