rancher: Expected state running but got error: Timeout getting IP address

Rancher versions: rancher/server: v1.5.6 rancher/agent: v1.2.2

Infrastructure Stack versions: healthcheck: v0.2.3 ipsec: net:holder network-services: Network manager service is v0.7.4 and metadata service is v0.9.2 scheduler: v0.7.5`

Docker version: (docker version,docker info preferred) 1.12.6

Operating system and kernel: (cat /etc/os-release, uname -r preferred) CoreOS 1298.x, Kernel: 4.9.9/4.9.16

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) AWS

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB) HA Rancher, external DB.

Environment Template: (Cattle/Kubernetes/Swarm/Mesos) Cattle

Steps to Reproduce:

Spin up a new host and bootsrap with shell script to use a VPC internal IP.

Results:

Seeing Expected state running but got error: Timeout getting IP address errors on the rancher server (master) for the new host that gets spun up. There is more info and logs - will be dumping that in an update.

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18

Most upvoted comments

This error can happen if the infrastructure stack version is not compatible with the current version of rancher/server. One of the ways to confirm that would be to navigate to Stacks > Infrastructure and check for the message next to “network-services”. (Template version not found)

Example:

image

Steps to fix this:

Step 1: 
    - Navigate to here: http(s)://<rancher_server:port>/v1-catalog/templates

Step 2:
    - Search for network-services and take note of defaultTemplateVersionId.
      Example: defaultTemplateVersionId": "library:infra*network-services:20"

Step 3:
    - Navigate to http(s)://<rancher_server:port>/
    - Open the affected Environment.
    - Click on Stacks > Infrastructure
    - Click on the menu of the network services stack and click 'View in API'
    - You should see 'externalId' not matching 'defaultTemplateVersionId' we figured in previous step.

Step 4:
    - Edit the stack via API browser and set to the correct folder version, click 'Show Request' followed by 'Send Request'

Step 5:
    - The button next to the network-services stack should show 'Up to date'.
    - Click on the 'Up to date' version, select the current version and click Upgrade.

Thank you for the information @leodotcloud, your reply helped me solve this issue. I think another option to solve the problems of the network-services version without touching the API would be:

  1. Delete the entire stack of the network-services (Infraestructure)
  2. Go to the Catalog of Rancher
  3. Search for network-services
  4. Select latest version and deploy

It worked for me and you don’t have to do any modification using the API.

By the way, I’m using Rancher 1.6.7

In v1.5.0-v1.6.0, the automatic upgrade manager had some faulty logic of when to upgrade the network-services. We were required to create this automatic upgrade due to major metadata changes that went into v1.5.0. We had corrected the automatic upgrade logic as of v1.6.1. At the same time, in v1.6.1, we had moved from using the master branch of the Rancher catalog to a new v1.6-release branch in order to be able to keep updating network services stack in Rancher and hopefully prevent any further unintended automatic upgrades in the v1.5.0-v1.6.0 versions.

Unfortunately, last week, we accidentally merged a PR into master for 30 minutes. The automatic upgrade of network services only occurs once an hour so if your thread was triggered in that half hour window when a new template was available, you would have ended up automatically upgrading your network-services. And then getting automatically downgrade (when we reverted the PR) and stuck in the situation where it doesn’t know what template version that you are on (ie. the grey button on your stack).

Initially, our first method of preventing automatic ugprades was to separate the branch used in Rancher catalog. But after the accidentally merge, we’ve put on additional protection to the master branch so that any PRs would automatically look like this and it’d be much more apparent that it shouldn’t be merged.

screen shot 2017-07-20 at 8 21 31 pm

Again, we apologize for this accident, but hope the workaround has solved it. If you upgrade to v1.6.2+, you will not incur this problem as the automatic upgrade was fixed.

So we ran into the same thing on our Rancher 1.5 cluster, and those 5 steps fixed it for us too.

defaultTemplateVersionId was @ 17 and our network-services stack had externalId @ 20. We manually updated the network-services externalId down to 17 and “upgraded” the stack to the current version. It actually put the networking container images down a few versions – and it took a really long time over 50+ workers but was ultimately successful.

@leodotcloud Any idea how a network-services stack can get into this state and how it can be avoided in future? Were some backwards-incompatible changes made to the catalogs / docker image tags or something like that?

It’s confusing, as most of the workers were running fine on those higher versions. New workers, as of about a week or two ago, started failing to bring up any containers (infrastructure or otherwise) with the timeout error.