terraform-provider-rancher2: Intermittently imports of EKS clusters never finish

Versions

  • Rancher version: 2.6.8
  • Rancher Terraform provider: 1.24.0
  • Terraform: 1.2.2

Information about the Cluster

  • Kubernetes version: 1.21
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Hosted EKS

Describe the bug

Sometimes importing an EKS cluster will never complete (saying “Still creating…” for 30 min then time-out), but the cluster is active in the Rancher instance. Other times it finishes in seconds. To Reproduce

Using this code to import the cluster. The aws-auth configMap has already been updated with the user referred to by the cloud_credential.

resource "rancher2_cloud_credential" "this" {
  name        = var.name_prefix
  description = "Credentials used for managing ${var.name_prefix}"
  amazonec2_credential_config {
    access_key = aws_iam_access_key.rancher.id
    secret_key = aws_iam_access_key.rancher.secret
  }
}

resource "rancher2_cluster" "imported_eks_cluster" {
  name        = var.cluster_id
  description = "Terraform EKS cluster"
  eks_config_v2 {
    cloud_credential_id = rancher2_cloud_credential.this.id
    name                = var.cluster_id
    region              = var.region
    imported            = true
  }
}

Result Sometimes this happens until the time-out but the cluster is active in Rancher:

module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [10m40s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [10m50s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [11m0s elapsed]
module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster: Still creating... [11m10s elapsed]
...
Error: [ERROR] waiting for cluster (c-xfbkg) to be created: timeout while waiting for state to become 'pending' (last state: 'active', timeout: 30m0s)
│ 
│   with module.import_to_rancher[0].rancher2_cluster.imported_eks_cluster,
│   on .terraform/modules/import_to_rancher/main.tf line 27, in resource "rancher2_cluster" "imported_eks_cluster":
│   27: resource "rancher2_cluster" "imported_eks_cluster" {

Expected Result

The cluster is consistently imported in a few seconds. Screenshots

Additional context

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 17 (6 by maintainers)

Most upvoted comments

We also sometimes encountered the problem mentioned at the beginning that the expectedState between Terraform (pending) and the status of the Rancher import (active) did not match.

Our previous workarounds were to avoid the “active” state by, for example, setting up the authorisation or the network connection at a later time. In the end, however, this only resulted in the status in Rancher being “waiting” and also did not match the “pending” expected by the Terraform provider.

In my opinion, “active” should definitely be included in the expectedStates. Whether “waiting” should be part of the expectedState is certainly a topic for discussion and depends on whether the status of a successful import or only the status of a successfully created import “resource” is to be checked here. The latter would also include “waiting”, since as soon as all prerequisites have been met, the import continues and hopefully jumps to the “Active” state.

Currently our solution is to use the implemented fix from PR https://github.com/rancher/terraform-provider-rancher2/pull/1114 and we can confirm that it works fine.

Versions used:

  • Rancher version: 2.7.4/2.6.10
  • Rancher Terraform provider: 3.0.0
  • Terraform: 1.4.5
  • EKS downstream K8s: v1.24.x/v1.23.x
  • Rancher K8s: v1.24.13/v1.23.17

For QA, post the test steps here just to be clear on how to verify the intermittent import issue is resolved.

@cpinjani - please link your testplan here once you start working on rancher/eks-operator#84

Hey,

I have been testing this locally and was not able to reproduce it after trying and applying it multiple times (maybe I was lucky).

All tests that I have done with versions:

Test1:

  • Rancher version: 2.6.8
  • Rancher Terraform provider: 1.24.0
  • Terraform: 1.2.2
  • Kubernetes version: 1.22 (this was the oldest available version in EKS)
  • Cluster Type (Local/Downstream): Downstream Hosted EKS
  • Local Rancher cluster v1.24.4+k3s1

Screenshot 2023-05-04 at 11 55 56

Test 2:

  • Rancher version: 2.6.8
  • Rancher Terraform provider: 1.24.1
  • Terraform: 1.3.2
  • Kubernetes version: 1.24
  • Cluster Type (Local/Downstream): Downstream Hosted EKS
  • Local Rancher cluster v1.24.4+k3s1

Screenshot 2023-05-04 at 12 58 02

Also, there is a PR that I submitted https://github.com/rancher/terraform-provider-rancher2/pull/1114 that tries to fix this issue

As workaround for those who think they need to destroy their entire state to reimport, I was able to get away with just removing rancher2_cluster via

terraform state rm rancher2_cluster.mycluster

and then import it via

terraform import rancher2_cluster.mycluster c-abcd

Thus I didn’t need to kill everything terraform had managed to provision so far. Seemed working.

Ran into this today. From provider config: https://github.com/rancher/terraform-provider-rancher2/blob/master/rancher2/resource_rancher2_cluster.go#L135

	expectedState := "active"

	if cluster.Driver == clusterDriverImported || (cluster.Driver == clusterDriverEKSV2 && cluster.EKSConfig.Imported) {
		expectedState = "pending"
	}

it appears provider expects state to become pending first. However, if rancher side is faster than provider polling loop then rancher cluster may become active so fast that provider misses it. From what limited understanding of Go I have, I understand that it would actually possible to wait for multiple targets in

       stateConf := &resource.StateChangeConf{
		Pending:    []string{},
		Target:     []string{expectedState},
		Refresh:    clusterStateRefreshFunc(client, newCluster.ID),
		Timeout:    d.Timeout(schema.TimeoutCreate),
		Delay:      1 * time.Second,
		MinTimeout: 3 * time.Second,
	}
	_, waitErr := stateConf.WaitForState()
	if waitErr != nil {
		return fmt.Errorf("[ERROR] waiting for cluster (%s) to be created: %s", newCluster.ID, waitErr)
	}

If for EKS it would be allowed to test against both pending and active targets, this probably could be fixed?

I have been seeing this as well, on successful runs it takes seconds, but occasionally this hangs.