kops: Cannot use terraform and gossip-based cluster at the same time

If you create a cluster with both terraform and gossip options enabled, all kubectl commands shall fail.


How to reproduce the error

My environment

$ uname -a
Darwin *****.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64

$ kops version
Version 1.6.2

$ terraform version
Terraform v0.9.11

$ aws --version
aws-cli/1.11.117 Python/2.7.10 Darwin/16.6.0 botocore/1.5.80

Setting up the cluster

# Create RSA key
ssh-keygen -f shared_rsa -N ""

# Create S3 bucket
aws s3api create-bucket \
  --bucket=kops-temp \
  --region=ap-northeast-1 \
  --create-bucket-configuration LocationConstraint=ap-northeast-1

# Create terraform codes and some resources
# including *certificates* will be stored to S3
kops create cluster \
  --name=kops-temp.k8s.local \
  --state=s3://kops-temp \
  --zones=ap-northeast-1a,ap-northeast-1c \
  --ssh-public-key=./shared_rsa.pub \
  --out=. \
  --target=terraform

# Create cluster
terraform init
terraform plan -out ./create-cluster.plan
terraform show ./create-cluster.plan | less -R # final review
terraform apply ./create-cluster.plan # fire

# Done

Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.

Scenario 1. Looking up non-existent domain

$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.kops-temp.k8s.local on 8.8.8.8:53: no such host

This is basically because of erroneous ~/.kube/config file. If you run the kops create cluster with both terraform and gossip options enabled, you’ll get wrong ~/.kube/config file.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api.kops-temp.k8s.local
            # !!!! There's no such domain named "api.kops-temp.k8s.local"
  name: kops-temp.k8s.local
# ...

Let’s manually correct that file. Or, you’ll get good config file if you explicitly export the configuration once again.

kops export kubecfg kops-temp.k8s.local --state s3://kops-temp

Then the non-existent domain will be replaced with the ELB of master nodes’ DNS name.

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ABCABCABCABC...
    server: https://api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
  name: kops-temp.k8s.local
# ...

And you’ll be ended up to the scenario 2 when you retry.

Scenario 2. Invalid certificate

$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for api.internal.kops-temp.k8s.local, api.kops-temp.k8s.local, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, not api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com

This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.

2017-07-19 12 01 32

(Sorry for the Korean, this is the list of DNS alternative names of certificate)

The only way to workaround this problem is forcing “kops-temp.k8s.local” to point proper IP address via manually editing /etc/hosts, which is undesired for many people.

# Recover ~/.kube/config
perl -i -pe \
    's|api-kops-temp-k8s-local-rnvnqsr-666666\.ap-northeast-1\.elb\.amazonaws\.com|api.kops-temp.k8s.local|g' \
    ~/.kube/config

# Hack /etc/hosts
host api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com |
    perl -pe 's|^.* address (.*)$|\1\tapi.kops-temp.k8s.local|g' |
    sudo tee -a /etc/hosts

# This will succeed
kubectl get nodes

I’m not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 17
  • Comments: 64 (17 by maintainers)

Commits related to this issue

Most upvoted comments

Another workaround that does not require waiting to roll the master(s) is to create the ELB, then update the cluster and then do the rest of the terraform apply. Steps are:

  • Create cluster as usual
  • Create internet gateway, or ELB will fail to deploy: terraform apply -target aws_internet_gateway.CLUSTERNAME-k8s-local
  • Create ELB: terraform apply -target aws_elb.api-CLUSTERNAME-k8s-local
  • Update cluster (which will catch the DNS name for the ELB and issue a new master cert, as well as export a new kubecfg): kops update cluster --out=. --target=terraform
  • Create everything else: terraform apply

If you run kops update cluster $NAME --target=terraform after the terraform apply it, it’s actuallyt gonna generate a new certificate kops export kubecfg $NAME after that and you got a working thing. Although i know, its not pretty straightforward

@sybeck2k, we have also experienced this issue as of a few hours ago.

You will need to run kops rolling-update cluster --cloudonly --force --yes to force an update. This can take awhile depending on the size of the cluster, but we have found that trying to manually set the --master-interval or --node-interval can prevent nodes from reaching a Ready state. I suggest just grabbing some ☕️ and let the default interval do it’s thing.

It is still a workaround solution atm, but we have found it to be repeatably successful.

As above, this is still broken in: Version 1.8.1 (git-94ef202)

Generally, as I understand it, the workaround flow is: kops create cluster $NAME --target=terraform -out=. terraform apply kops rolling-update cluster $NAME --cloudonly --force --yes (around 20 mins with 3masters and 3 nodes) and then it should work but I had to re-export kops config kops export kubecfg $NAME and now it works for both kops and kubectl. Are there any ideas as to how resolve this? I was also wondering if, in general, gossip-based is inferior to DNS approach?

I am having the same problem here with (see version info below), the work around does indeed work but it takes way too long to complete - it would be great if this could be resolved.

kops version

Version 1.8.0 (git-5099bc5)      

@chrislovecnm I can still reproduce this in 1.8.0-beta.1. Both the steps are still required:

  • kops update cluster $NAME --target=terraform --out=.
  • kops rolling-update cluster --cloudonly --force --yes

@mbolek the issue indeed persists kops version Version 1.10.0

I reproduced this in 1.8.0 after kops create cluster ... --target=terraform and terraform apply

I can confirm that running the following fixed it: kops update cluster $NAME --target=terraform kops rolling-update cluster $NAME --cloudonly --force --yes

This issue is still persist… Kops version = 1.11.1

kops validate cluster

Using cluster from kubectl context: milkyway.k8s.local Validating cluster milkyway.k8s.local

unexpected error during validation: error listing nodes: Get https://api.milkyway.k8s.local/api/v1/nodes: dial tcp: lookup api.milkyway.k8s.local on 192.168.88.1:53: no such host

The configuration generated by kops and terraform continue to treat the API endpoint as a DNS .k8s.local and not with the ELB

Fyi: Still broken in 1.9.0

This should be fixed in master. If someone wants to test master or wait for the 1.8 beta release

#kops version
Version 1.9.2 (git-cb54c6a52

Ok… so I though I had something but it seems the issue persists. You need to export config to fix the API server endpoint and you need to roll master to fix the SSL cert

The fix using rolling-update did not work for me.

Version 1.9.0 (git-cccd71e67)