kops: Couldn't find key etcd_endpoints in ConfigMap kube-system/calico-config

**1. What kops version are you running? The command kops version, will display this information.** kops 1.12.1

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. Upgrading from v1.11.10 to 1.12.8

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes

5. What happened after the commands executed?

master not healthy after update, stopping rolling-update: “error validating cluster after removing a node: cluster did not validate within a duration of "5m0s"”

6. What did you expect to happen?

Validation to complete successfully

9. Anything else do we need to know?

I clearly messed up the upgrade from v1.11.10 to 1.12.8

I originally ran

   kops update...
   kops rolling-update cluster --yes

Above failed on first master with master not healthy after update, stopping rolling-update: "error validating cluster after removing a node: cluster did not validate within a duration of \"5m0s\""

Validation failing due to Pod kube-system/calico-complete-upgrade-v331-mz6z9 kube-system pod "calico-complete-upgrade-v331-mz6z9" is pending

Warning Failed XXXXX Error: Couldn't find key etcd_endpoints in ConfigMap kube-system/calico-config

I then ran the following as per offical docs kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes

Which upgraded all the nodes but the validation is still a failure due to error above.

Can I terminate the master the which originally failed?

Any help is appreciated

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 3
  • Comments: 15 (3 by maintainers)

Most upvoted comments

@cn-5p1ke I still dont think the errors for me are 100% resolved, but I was able to validate the cluster in that given time. PS, I didnt have RBAC enabled on this test cluster

In my case, I manually added the field etcd_endpoints in the configmap and also changed the last-applied-configuration, which was a pain. I was able to get the calico-kube-controllers pod working at that given time (which was in failed/pending state) due to which my cluster was not getting validated.

However, I still see differences between my other clusters compared to this one. For example, I see:

  • Pods - etcd-manager-events-ip in test cluster
  • Pods - etcd-server-events-ip in other clusters.
  • Moreover, I dont see etcd-server-ip pods at all in test cluster at all

Cluster seem to be running okay for now (inter pod communication), but I believe I will have to troubleshoot something real soon.

Its because of this here. etcd2-3 migration is disruptive to masters (I am on etcd 3). I will try to upgrade to 1.13 to see if this resolves the issue (since now its stable)