kops: Cannot use terraform and gossip-based cluster at the same time
If you create a cluster with both terraform and gossip options enabled, all kubectl commands shall fail.
How to reproduce the error
My environment
$ uname -a
Darwin *****.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64
$ kops version
Version 1.6.2
$ terraform version
Terraform v0.9.11
$ aws --version
aws-cli/1.11.117 Python/2.7.10 Darwin/16.6.0 botocore/1.5.80
Setting up the cluster
# Create RSA key
ssh-keygen -f shared_rsa -N ""
# Create S3 bucket
aws s3api create-bucket \
--bucket=kops-temp \
--region=ap-northeast-1 \
--create-bucket-configuration LocationConstraint=ap-northeast-1
# Create terraform codes and some resources
# including *certificates* will be stored to S3
kops create cluster \
--name=kops-temp.k8s.local \
--state=s3://kops-temp \
--zones=ap-northeast-1a,ap-northeast-1c \
--ssh-public-key=./shared_rsa.pub \
--out=. \
--target=terraform
# Create cluster
terraform init
terraform plan -out ./create-cluster.plan
terraform show ./create-cluster.plan | less -R # final review
terraform apply ./create-cluster.plan # fire
# Done
Spoiler Alert: Creating the self-signed certificate before creating actual Kubernetes cluster is the root cause of this issue. Please continue to see why.
Scenario 1. Looking up non-existent domain
$ kubectl get nodes
Unable to connect to the server: dial tcp: lookup api.kops-temp.k8s.local on 8.8.8.8:53: no such host
This is basically because of erroneous ~/.kube/config file. If you run the kops create cluster with both terraform and gossip options enabled, you’ll get wrong ~/.kube/config file.
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: ABCABCABCABC...
server: https://api.kops-temp.k8s.local
# !!!! There's no such domain named "api.kops-temp.k8s.local"
name: kops-temp.k8s.local
# ...
Let’s manually correct that file. Or, you’ll get good config file if you explicitly export the configuration once again.
kops export kubecfg kops-temp.k8s.local --state s3://kops-temp
Then the non-existent domain will be replaced with the ELB of master nodes’ DNS name.
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: ABCABCABCABC...
server: https://api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
name: kops-temp.k8s.local
# ...
And you’ll be ended up to the scenario 2 when you retry.
Scenario 2. Invalid certificate
$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for api.internal.kops-temp.k8s.local, api.kops-temp.k8s.local, kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, not api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com
This is simply because the DNS name of ELB is not included in the certificate. This scenario occurs only when you create the cluster with terraform option being enabled. If you try to create the cluster only with gossip option not using the terraform target, the self-signed certificate will properly contain the DNS name of ELB.
(Sorry for the Korean, this is the list of DNS alternative names of certificate)
The only way to workaround this problem is forcing “kops-temp.k8s.local” to point proper IP address via manually editing /etc/hosts, which is undesired for many people.
# Recover ~/.kube/config
perl -i -pe \
's|api-kops-temp-k8s-local-rnvnqsr-666666\.ap-northeast-1\.elb\.amazonaws\.com|api.kops-temp.k8s.local|g' \
~/.kube/config
# Hack /etc/hosts
host api-kops-temp-k8s-local-nrvnqsr-666666.ap-northeast-1.elb.amazonaws.com |
perl -pe 's|^.* address (.*)$|\1\tapi.kops-temp.k8s.local|g' |
sudo tee -a /etc/hosts
# This will succeed
kubectl get nodes
I’m not very familiar with Kops internal, but I expect a huge change to properly fix this issue. Maybe using AWS Certificate Manager can be a solution. (#834) Any ideas?
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 17
- Comments: 64 (17 by maintainers)
Commits related to this issue
- Workaround kops+gossip+terraform issue. Apply the fix from https://github.com/kubernetes/kops/issues/2990#issuecomment-417333014 to ensure the gossip-based cluster is reachable when using terraform o... — committed to mattermost/mattermost-cloud by lieut-data 5 years ago
- Workaround kops+gossip+terraform issue. (#7) Apply the fix from https://github.com/kubernetes/kops/issues/2990#issuecomment-417333014 to ensure the gossip-based cluster is reachable when using terr... — committed to mattermost/mattermost-cloud by lieut-data 5 years ago
Another workaround that does not require waiting to roll the master(s) is to create the ELB, then update the cluster and then do the rest of the terraform apply. Steps are:
terraform apply -target aws_internet_gateway.CLUSTERNAME-k8s-localterraform apply -target aws_elb.api-CLUSTERNAME-k8s-localkops update cluster --out=. --target=terraformterraform applyIf you run
kops update cluster $NAME --target=terraformafter the terraform apply it, it’s actuallyt gonna generate a new certificatekops export kubecfg $NAMEafter that and you got a working thing. Although i know, its not pretty straightforward@sybeck2k, we have also experienced this issue as of a few hours ago.
You will need to run
kops rolling-update cluster --cloudonly --force --yesto force an update. This can take awhile depending on the size of the cluster, but we have found that trying to manually set the--master-intervalor--node-intervalcan prevent nodes from reaching a Ready state. I suggest just grabbing some ☕️ and let the default interval do it’s thing.It is still a workaround solution atm, but we have found it to be repeatably successful.
As above, this is still broken in:
Version 1.8.1 (git-94ef202)Generally, as I understand it, the workaround flow is:
kops create cluster $NAME --target=terraform -out=.terraform applykops rolling-update cluster $NAME --cloudonly --force --yes(around 20 mins with 3masters and 3 nodes) and then it should work but I had to re-export kops configkops export kubecfg $NAMEand now it works for both kops and kubectl. Are there any ideas as to how resolve this? I was also wondering if, in general, gossip-based is inferior to DNS approach?I am having the same problem here with (see version info below), the work around does indeed work but it takes way too long to complete - it would be great if this could be resolved.
kops version
@chrislovecnm I can still reproduce this in
1.8.0-beta.1. Both the steps are still required:kops update cluster $NAME --target=terraform --out=.kops rolling-update cluster --cloudonly --force --yes@mbolek the issue indeed persists kops version Version 1.10.0
I reproduced this in
1.8.0afterkops create cluster ... --target=terraformandterraform applyI can confirm that running the following fixed it:
kops update cluster $NAME --target=terraformkops rolling-update cluster $NAME --cloudonly --force --yesThis issue is still persist… Kops version = 1.11.1
kops validate cluster
Using cluster from kubectl context: milkyway.k8s.local Validating cluster milkyway.k8s.local
unexpected error during validation: error listing nodes: Get https://api.milkyway.k8s.local/api/v1/nodes: dial tcp: lookup api.milkyway.k8s.local on 192.168.88.1:53: no such host
The configuration generated by kops and terraform continue to treat the API endpoint as a DNS .k8s.local and not with the ELB
Fyi: Still broken in 1.9.0
This should be fixed in master. If someone wants to test master or wait for the 1.8 beta release
Ok… so I though I had something but it seems the issue persists. You need to export config to fix the API server endpoint and you need to roll master to fix the SSL cert
The fix using rolling-update did not work for me.
Version 1.9.0 (git-cccd71e67)