rancher: Expected state running but got error: Timeout getting IP address
Rancher versions:
rancher/server: v1.5.6
rancher/agent: v1.2.2
Infrastructure Stack versions:
healthcheck: v0.2.3
ipsec: net:holder
network-services: Network manager service is v0.7.4 and metadata service is v0.9.2 scheduler:
v0.7.5`
Docker version: (docker version
,docker info
preferred) 1.12.6
Operating system and kernel: (cat /etc/os-release
, uname -r
preferred) CoreOS 1298.x
, Kernel: 4.9.9/4.9.16
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) AWS
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB) HA Rancher, external DB.
Environment Template: (Cattle/Kubernetes/Swarm/Mesos) Cattle
Steps to Reproduce:
Spin up a new host and bootsrap with shell script to use a VPC internal IP.
Results:
Seeing Expected state running but got error: Timeout getting IP address
errors on the rancher server (master) for the new host that gets spun up. There is more info and logs - will be dumping that in an update.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 18
This error can happen if the infrastructure stack version is not compatible with the current version of
rancher/server
. One of the ways to confirm that would be to navigate to Stacks > Infrastructure and check for the message next to “network-services”. (Template version not found
)Example:
Steps to fix this:
Thank you for the information @leodotcloud, your reply helped me solve this issue. I think another option to solve the problems of the network-services version without touching the API would be:
It worked for me and you don’t have to do any modification using the API.
By the way, I’m using Rancher 1.6.7
In v1.5.0-v1.6.0, the automatic upgrade manager had some faulty logic of when to upgrade the network-services. We were required to create this automatic upgrade due to major metadata changes that went into v1.5.0. We had corrected the automatic upgrade logic as of v1.6.1. At the same time, in v1.6.1, we had moved from using the
master
branch of the Rancher catalog to a newv1.6-release
branch in order to be able to keep updating network services stack in Rancher and hopefully prevent any further unintended automatic upgrades in the v1.5.0-v1.6.0 versions.Unfortunately, last week, we accidentally merged a PR into
master
for 30 minutes. The automatic upgrade of network services only occurs once an hour so if your thread was triggered in that half hour window when a new template was available, you would have ended up automatically upgrading your network-services. And then getting automatically downgrade (when we reverted the PR) and stuck in the situation where it doesn’t know what template version that you are on (ie. the grey button on your stack).Initially, our first method of preventing automatic ugprades was to separate the branch used in Rancher catalog. But after the accidentally merge, we’ve put on additional protection to the
master
branch so that any PRs would automatically look like this and it’d be much more apparent that it shouldn’t be merged.Again, we apologize for this accident, but hope the workaround has solved it. If you upgrade to v1.6.2+, you will not incur this problem as the automatic upgrade was fixed.
So we ran into the same thing on our Rancher 1.5 cluster, and those 5 steps fixed it for us too.
defaultTemplateVersionId
was @ 17 and our network-services stack hadexternalId
@ 20. We manually updated the network-servicesexternalId
down to 17 and “upgraded” the stack to the current version. It actually put the networking container images down a few versions – and it took a really long time over 50+ workers but was ultimately successful.@leodotcloud Any idea how a network-services stack can get into this state and how it can be avoided in future? Were some backwards-incompatible changes made to the catalogs / docker image tags or something like that?
It’s confusing, as most of the workers were running fine on those higher versions. New workers, as of about a week or two ago, started failing to bring up any containers (infrastructure or otherwise) with the timeout error.