terraform-provider-rancher2: rancher2_cluster_sync failed with [file-deployer] is still running

Hi,

we have the problem that the rancher2_cluster_sync resource failed with an error.

Versions:

terraform 1.0.0
rancher2 = {
      source  = "rancher/rancher2"
      version = "1.15.1"
    }
    rke = {
      source  = "rancher/rke"
      version = "1.2.1"
  }

Resource:

resource "rancher2_cluster_sync" "wait-for-cluster-is-ready" {
  depends_on = [
    null_resource.init_cluster,
  ]
  cluster_id      = rancher2_cluster.rancher-cluster.id
  state_confirm   = 2
  wait_alerting   = false
  wait_catalogs   = false
  wait_monitoring = false
}

Error: Cluster ID c-5srfq: Container [file-deployer] is still running on host [10.106.84.13]: stderr: [], stdout: []

  with module.ahoj-cluster.rancher2_cluster_sync.wait-for-cluster-is-ready,
  on modules/rancher-worker/rancher-cluster.tf line 190, in resource "rancher2_cluster_sync" "wait-for-cluster-is-ready":
 190: resource "rancher2_cluster_sync" "wait-for-cluster-is-ready" {

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 25 (11 by maintainers)

Most upvoted comments

It seems the issue is caused by a cluster state flapping.

Submitted PR https://github.com/rancher/terraform-provider-rancher2/pull/732 to fix the issue. When the condition is false, added extra check if Condition.LastUpdateTime > 120s, before returning error.

@gohumble , unfortunately this is expected due to 1.17.1 doesn’t have any other fix for this.

The issue seems a race condition occurring at Rancher not at the tf provider. The condition should be unknown until the transition is finished, if not there is no chance to check the cluster state from the API. Investigating on this way.