kops: AWS authenticator stops master from joining cluster

1. What kops version are you running? The command kops version, will display this information.

$ kops version
Version 1.10.0

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

$ kops edit cluster
# .. added authentication section
$ kops rolling-update cluster --instance-group-roles=Master --force --yes

The authentication section added:

authentication:
  aws: {}

as described in: ./docs/authentication.md#aws-iam-authenticator

5. What happened after the commands executed?

kops drained and stopped the first master node. A new EC2 instance was created by AWS auto scaling group, but that instance was never able to join the cluster. The cluster validation step therefore failed.

$ kops rolling-update cluster --instance-group-roles=Master --force --yes

I1204 11:13:18.976387   85319 instancegroups.go:157] Draining the node: "ip-172-31-26-101.x.y.z".
node "ip-172-31-26-101.x.y.z" cordoned
node "ip-172-31-26-101.x.y.z" cordoned
node "ip-172-31-26-101.x.y.z" drained
I1204 11:13:34.924664   85319 instancegroups.go:338] Waiting for 1m30s for pods to stabilize after draining.
I1204 11:15:04.911031   85319 instancegroups.go:278] Stopping instance "i-0d7d0c586f9a7", node "ip-172-31-26-101.x.y.z", in group "master-x.masters.k8s.y.z" (this may take a while).
I1204 11:20:06.535982   85319 instancegroups.go:188] Validating the cluster.
I1204 11:20:14.653181   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:20:48.953718   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:21:18.639329   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:21:48.260554   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.
I1204 11:22:18.447799   85319 instancegroups.go:251] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-04ace14bd333d8" has not yet joined cluster.

master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duation of \"5m0s\""

Seeing this in the master nodes logs:

main.go:142] got error running nodeup (will retry in 30s): error building loader: certificate "aws-iam-authenticator" not found
s3fs.go:219] Reading file "s3://x-kops-state/k8s.x.y.z/pki/issued/aws-iam-authenticator/keyset.yaml"

6. What did you expect to happen?

All the master nodes to be updated with the aws-iam-authenticator enabled & ready for action.

/cc @rdrgmnzs

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 49 (20 by maintainers)

Commits related to this issue

Most upvoted comments

I’m hitting a similar issue. What I experienced was this:

  • Add authentication: aws{} to existing cluster
  • Daemonset was created automatically (no update cluster or rolling-update)
  • Pods failed to create due to no configmap
  • Configmap was created; pods started crashing due to cert problems
  • Rolling update on masters performed
  • Pods became healthy

At this point, I’m getting the following message, though:

time="2019-06-10T17:46:40Z" level=info msg="reconfigure your apiserver with `--authentication-token-webhook-config-file=/etc/kubernetes/heptio-authenticator-aws/kubeconfig.yaml` to enable (assuming default hostPath mounts)"

And I confirmed the flag is not passed to the api-server pod.

@phillipj I think I kinder found a workaround, seems working so far but this has increased fresh cluster creation delay at least by 20 mins. I think we need a permanent solution though

  # apply config for iam authenticator
    kubectl apply -f iam-config-map.yaml

edit kops manifest and save with

  authentication:
    aws: {}
  authorization:
    rbac: {}

then apply below commands.

    kops update cluster $NAME --yes
    kops rolling-update cluster ${NAME} --instance-group-roles=Master  --cloudonly --force --yes
    kops validate cluster

results were kubectl get pods -n kube-system |grep iam

aws-iam-authenticator-4dtsh                                            1/1     Running   0          21m
aws-iam-authenticator-6zwt7                                            1/1     Running   0          30m
aws-iam-authenticator-rhr46                                            1/1     Running   0          26m

same problem here.

Hi guys, I’m still working on this while also dealing with work and real life. Unfortunately I still have not been able to replicate this issue on either new or existing clusters. If you are able to share pastebins of your kops-configuration logs, protokube logs, kubelet logs and kops configs may help lead me to being able to identify what is causing the issue here. If any of you is able to identify what is causing the issue PRs are always welcome as well.

In the meantime if this is a blocker for you, remember that you can still manually deploy the authenticator without relying on kops to do so. Chris Hein has a great blog post on host to do so here

Thanks a lot for confirming it’s not only me! 😄

I’m still awaiting further rollout to production clusters because of this. Fully understand the original contributor doesn’t have time to dig further into this, at the same time I don’t know who else to ping…

Thanks for all the info and debugging. The certs are pulled during the host provisioning process with protokube, which is why I’m baffled by this issue. If you happen to hit this again, could you please take a look at the protokube logs and check if it is failing to pull the certs for some reason.