terraform-provider-kubernetes: v2.0.1 Authentication failures with token retrieved via aws_eks_cluster_auth

Terraform Version, Provider Version and Kubernetes Version

Terraform version: 0.12.24
Kubernetes provider version: 2.0.1
Kubernetes version: v1.16.15-eks-ad4801

Affected Resource(s)

Terraform Configuration Files

data "aws_eks_cluster" "c" {
  name = var.k8s_name
}

data "aws_eks_cluster_auth" "c" {
  name = var.k8s_name
}

provider "kubernetes" {
  host = data.aws_eks_cluster.c.endpoint

  cluster_ca_certificate = base64decode(data.aws_eks_cluster.c.certificate_authority.0.data)

  token = data.aws_eks_cluster_auth.c.token
}

Debug Output

Panic Output

Steps to Reproduce

Expected Behavior

What should have happened? Resources should have been created/modified/deleted.1

Actual Behavior

What actually happened?

Error: the server has asked for the client to provide credentials
Error: Failed to update daemonset: Unauthorized
Error: Failed to update deployment: Unauthorized
Error: Failed to update deployment: Unauthorized
Error: Failed to update service account: Unauthorized
Error: Failed to update service account: Unauthorized
Error: Failed to delete Job! API error: Unauthorized
Error: Failed to update service account: Unauthorized
Error: the server has asked for the client to provide credentials
Error: the server has asked for the client to provide credentials
Error: Failed to update deployment: Unauthorized
Error: Failed to update service account: Unauthorized
Error: the server has asked for the client to provide credentials
Error: Failed to delete Job! API error: Unauthorized
Error: Failed to update daemonset: Unauthorized

Important Factoids

No, we’re just using EKS.

References

  • GH-1234

Community Note

  • Please vote on this issue by adding a +1 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 39
  • Comments: 37 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Using exec is not a viable solution when running in terraform cloud using remote execution. Our current thinking is to implement a workaround to essentially taint the aws_eks_cluster_auth data source so it gets refreshed for every plan. It would be ideal if the kubernetes provider had native support for getting and refreshing managed kubernetes service authentication tokens / credentials in order to support environments in which the only guaranteed tooling is terraform itself.

Can you try running terraform refresh to see if that pulls in a new token? The token generated by aws_eks_cluster_auth is only valid for 15 minutes. For this reason, we recommend using an exec plugin to keep the token up to date automatically. Here’s an example of that configuration:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

Alternatively, running the Kubernetes provider in separate terraform apply from the EKS cluster creation should work every time. (I’m not sure offhand if your EKS cluster is being created in the same apply, but just guessing since it’s a common configuration).

There’s also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we’re working on related authentication issues.

This issue in the very least should require a review of all of the official documentation, since you cannot actually use the provider in it’s documented state.

I’m just using local exec to deploy the few Kubernetes resources I want to “manage” with Terraform. At the moment I don’t want to split my rather small Terraform state into at least two layers just to be able to use the Kubernetes provider properly with an AWS EKS Kubernetes cluster 💁‍♀️

@dak1n1 This config worked for me. Thanks!

data "aws_eks_cluster" "default" {
  name       = module.eks.name
  depends_on = [module.eks.name]
}

data "aws_eks_cluster_auth" "default" {
  name = module.eks.name
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}

We run into this issue with virtually every apply now that we use Atlantis:

  1. PR is opened with some changes to the TF config
  2. Atlantis runs terraform plan and comments the output on the PR
  3. Someone looks at the plan and approves it
  4. PR opener comments atlantis apply (which causes Atlantis to run terraform apply)
  5. Apply fails with Unauthorized if there are any kubernetes_* resources

This happens whenever the time between step 2 and step 4 is more than 15 minutes.

The workaround of callingaws eks get-token from the provider configuration would only work if we add the AWS CLI to the Atlantis container image. We can do that but it seems like a bit of a hack.

Is it a limitation of Terraform that this provider cannot refresh the token during apply? Is there a related Terraform issue?

I run into the same problem in TFC. The cause is I used an IAM role as AWS provider.

provider "aws" {
  assume_role {
    role_arn = var.assume_role_arn
  }
}

I solved this problem by explicitly specifying the IAM role when I get a token such as:

  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    command     = "aws"
    args = ["eks", "get-token", "--cluster-name", module.eks.name, "--role-arn", var.assume_role]
  }

Also, you may have to add your AWS region.

I would not call the exec solution a hack. It’s the default and preferred mechanism for credentials access on both EKS and GKE and it’s what the official tooling from both cloud providers uses by default.

Have a look at the contents of a kubeconfig file produces by the AWS CLI:

➤ aws eks update-kubeconfig --name k8s-dev                                                                                                                                                                                                         12:48:03
Added new context arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev to /Users/alex/.kube/config

➤ kubectl config view                                                                                                                                                                                                                                  12:48:35
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://XXXXXXXXXXXXXXXXXXXXXXXX.gr7.eu-central-1.eks.amazonaws.com
  name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
contexts:
- context:
    cluster: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
    user: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
  name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
current-context: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
kind: Config
preferences: {}
users:
- name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      args:
      - --region
      - eu-central-1
      - eks
      - get-token
      - --cluster-name
      - k8s-dev
      command: aws
      env: null
      interactiveMode: IfAvailable
      provideClusterInfo: false

The same happens on GKE, and for good reason.

Most IAM systems advise to use short lived credentials obtained via some sort of dynamic role impersonation. EKS doesn’t allow setting the lifespan of the token for the same reason. They want users to adopt role impersonation, which is the least risky way to handle credentials. This really isn’t a hack.

Back on the topic of Terraform, there is a solid reason why the datasource is not refreshed before apply in your scenario. SInce Atlantis is supplying a pre-generated plan to the terraform apply command, the contract implies that those should be the only changes enacted by terraform during the apply. If it were to refresh datasources, that would potentially propagate new values through the plan potentially incurring changes to resources after the plan had been reviewed and approved, thus negating the value of that process.

In conclusion, there really isn’t any better way of handling these short-lived credentials other than auth plugins.

Did some further digging and we may be barking in the wrong place: https://github.com/hashicorp/terraform-provider-aws/issues/10269#issuecomment-777906069

You can get around this with Kubernetes Service Account Tokens. The code snippet would look something like this:

# create service account
resource "kubernetes_service_account_v1" "terraform_admin" {
  metadata {
    name      = "terraform-admin"
    namespace = "kube-system"
    labels    = local.labels
  }
}

# grant privileges to the service account
module "terraform_admin" {
  source  = "aidanmelen/kubernetes/rbac"
  version = "v0.1.1"

  labels = local.labels

  cluster_roles = {
    "cluster-admin" = {
      create_cluster_role       = false
      cluster_role_binding_name = "terraform-admin-global"
      cluster_role_binding_subjects = [
        {
          kind = "ServiceAccount"
          name = kubernetes_service_account_v1.terraform_admin.metadata[0].name
        }
      ]
    }
  }
}

# retreive service account token from secret
data "kubernetes_secret" "terraform_admin" {
  metadata {
    name      = kubernetes_service_account_v1.terraform_admin.metadata[0].name
    namespace = kubernetes_service_account_v1.terraform_admin.metadata[0].namespace
  }
}

# call provider with long-lived service account token
provider "kubernetes" {
  alias                  = "terraform-admin"
  host                   = "https://kubernetes.docker.internal:6443"
  cluster_ca_certificate = data.kubernetes_secret.terraform_admin.data["ca.crt"]
  token                  = data.kubernetes_secret.terraform_admin.data["token"]
}

Please see authn-authz example from the aidanmelen/kubernetes/rbac module for more information.

⚠️ This comes with the security trade-off since this token will need to be manually rotated.

A related issue to this, is that this provider seems to update the state with the changes that it attempted to apply, as if the apply was successful, even though the authentication failed due to expired credentials.

So if you plan a change, and then wait 15 minutes, and then try to apply the plan, you will get an error like “Error: the server has asked for the client to provide credentials”. Then if you try to plan again with -refresh=false, there will be “No changes. Your infrastructure matches the configuration”. On large states this increases the pain of this issue considerably as it creates the need for repeated refreshing of the state which can take tens of minutes or more.

TFC allows one to use custom agents, as docker containers. Should be easy to add the auth plugins to those. It implies managing your own worker pool, which isn’t what everyone may want to do. The TFC development team is aware of this limitation, but they may not be aware of the amount of users affected. It may help to add weight to the issue by letting them know about it using their support request inputs.

SInce Atlantis is supplying a pre-generated plan to the terraform apply command, the contract implies that those should be the only changes enacted by terraform during the apply. If it were to refresh datasources, that would potentially propagate new values through the plan potentially incurring changes to resources after the plan had been reviewed and approved, thus negating the value of that process.

Thanks, this is the key insight I was missing, it is indeed not possible for the data source to be refreshed at apply time.

It’s unfortunate though that this means terraform cloud users are out of luck. We can build AWS CLI into our Atlantis image and set up processes for keeping it up to date, it’s an inconvenience but not that bad, but on some platforms there is no similar solution that would allow the exec approach to be used.

@jbg without logs and samples of your configuration, there isn’t a lot to go on in your report. Also, no Terraform, providers and cluster versions involved. Please help us help you.

@dak1n1 I am considering this as a temporary workaround.