kops: AWS authenticator stops master from joining cluster
1. What kops version are you running? The command kops version, will display
this information.
$ kops version
Version 1.10.0
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
$ kops edit cluster
# .. added authentication section
$ kops rolling-update cluster --instance-group-roles=Master --force --yes
The authentication section added:
authentication:
aws: {}
as described in: ./docs/authentication.md#aws-iam-authenticator
5. What happened after the commands executed?
kops drained and stopped the first master node. A new EC2 instance was created by AWS auto scaling group, but that instance was never able to join the cluster. The cluster validation step therefore failed.
$ kops rolling-update cluster --instance-group-roles=Master --force --yes
I1204 11:13:18.976387 85319 instancegroups.go:157] Draining the node: "ip-172-31-26-101.x.y.z".
node "ip-172-31-26-101.x.y.z" cordoned
node "ip-172-31-26-101.x.y.z" cordoned
node "ip-172-31-26-101.x.y.z" drained
I1204 11:13:34.924664 85319 instancegroups.go:338] Waiting for 1m30s for pods to stabilize after draining.
I1204 11:15:04.911031 85319 instancegroups.go:278] Stopping instance "i-0d7d0c586f9a7", node "ip-172-31-26-101.x.y.z", in group "master-x.masters.k8s.y.z" (this may take a while).
I1204 11:20:06.535982 85319 instancegroups.go:188] Validating the cluster.
I1204 11:20:14.653181 85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:20:48.953718 85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:21:18.639329 85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:21:48.260554 85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:22:18.447799 85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duation of \"5m0s\""
Seeing this in the master nodes logs:
main.go:142] got error running nodeup (will retry in 30s): error building loader: certificate "aws-iam-authenticator" not found
s3fs.go:219] Reading file "s3://x-kops-state/k8s.x.y.z/pki/issued/aws-iam-authenticator/keyset.yaml"
6. What did you expect to happen?
All the master nodes to be updated with the aws-iam-authenticator enabled & ready for action.
/cc @rdrgmnzs
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 49 (20 by maintainers)
Commits related to this issue
- Update docs on authentication Address #6154 — committed to flands/kops by flands 5 years ago
I’m hitting a similar issue. What I experienced was this:
authentication: aws{}to existing clusterAt this point, I’m getting the following message, though:
And I confirmed the flag is not passed to the api-server pod.
@phillipj I think I kinder found a workaround, seems working so far but this has increased fresh cluster creation delay at least by 20 mins. I think we need a permanent solution though
edit kops manifest and save with
then apply below commands.
results were kubectl get pods -n kube-system |grep iam
same problem here.
Hi guys, I’m still working on this while also dealing with work and real life. Unfortunately I still have not been able to replicate this issue on either new or existing clusters. If you are able to share pastebins of your kops-configuration logs, protokube logs, kubelet logs and kops configs may help lead me to being able to identify what is causing the issue here. If any of you is able to identify what is causing the issue PRs are always welcome as well.
In the meantime if this is a blocker for you, remember that you can still manually deploy the authenticator without relying on kops to do so. Chris Hein has a great blog post on host to do so here
Thanks a lot for confirming it’s not only me! 😄
I’m still awaiting further rollout to production clusters because of this. Fully understand the original contributor doesn’t have time to dig further into this, at the same time I don’t know who else to ping…
Thanks for all the info and debugging. The certs are pulled during the host provisioning process with protokube, which is why I’m baffled by this issue. If you happen to hit this again, could you please take a look at the protokube logs and check if it is failing to pull the certs for some reason.