kops: Master failing to join - connection refused

1. What kops version are you running? The command kops version, will display this information. error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

kops upgrade cluster
kops update cluster --yes
kops rolling-update --yes

5. What happened after the commands executed?

ops rolling-update cluster --yes
Using cluster from kubectl context: spaceti.co

NAME			STATUS		NEEDUPDATE	READY	MIN	MAX	NODES
bastions		Ready		0		1	1	1	0
c4xlNodes		NeedsUpdate	3		0	3	5	3
master-eu-central-1a	NeedsUpdate	1		0	1	1	0
master-eu-central-1b	NeedsUpdate	1		0	1	1	1
master-eu-central-1c	NeedsUpdate	1		0	1	1	0
nodes			NeedsUpdate	7		0	5	10	7
W0311 14:49:27.065301   15975 instancegroups.go:175] Skipping drain of instance "i-058c7724d9149892a", because it is not registered in kubernetes
W0311 14:49:27.065355   15975 instancegroups.go:183] no kubernetes Node associated with i-058c7724d9149892a, skipping node deletion
I0311 14:49:27.065375   15975 instancegroups.go:301] Stopping instance "i-058c7724d9149892a", in group "master-eu-central-1a.masters.spaceti.co" (this may take a while).
I0311 14:49:27.222670   15975 instancegroups.go:198] waiting for 5m0s after terminating instance
I0311 14:54:27.223004   15975 instancegroups.go:209] Validating the cluster.
I0311 14:54:28.726834   15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
I0311 14:55:00.005545   15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
I0311 14:55:30.000604   15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
I0311 14:55:59.560848   15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.

6. What did you expect to happen?

I expected master nodes to join the cluster

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2018-09-06T17:05:29Z
  name: cluster.co
spec:
  additionalPolicies:
    node: |
      [
        {"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"},
        {
          "Effect": "Allow",
          "Action": [
            "sts:AssumeRole"
          ],
          "Resource": [
            "arn:aws:iam::595924049331:role/k8s-*"
          ]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://private-state-store/cluster.co
  dnsZone: cluster.co
  encryptionConfig: true
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-central-1a
      name: a
    - instanceGroup: master-eu-central-1b
      name: b
    - instanceGroup: master-eu-central-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-central-1a
      name: a
    - instanceGroup: master-eu-central-1b
      name: b
    - instanceGroup: master-eu-central-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.7
  masterInternalName: api.internal.cluster.co
  masterPublicName: api.cluster.co
  networkCIDR: 172.20.0.0/16
  networking:
    calico: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.32.0/19
    name: eu-central-1a
    type: Private
    zone: eu-central-1a
  - cidr: 172.20.64.0/19
    name: eu-central-1b
    type: Private
    zone: eu-central-1b
  - cidr: 172.20.96.0/19
    name: eu-central-1c
    type: Private
    zone: eu-central-1c
  - cidr: 172.20.0.0/22
    name: utility-eu-central-1a
    type: Utility
    zone: eu-central-1a
  - cidr: 172.20.4.0/22
    name: utility-eu-central-1b
    type: Utility
    zone: eu-central-1b
  - cidr: 172.20.8.0/22
    name: utility-eu-central-1c
    type: Utility
    zone: eu-central-1c
  topology:
    bastion:
      bastionPublicName: bastion.cluster.co
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-09-06T17:05:31Z
  labels:
    kops.k8s.io/cluster: cluster.co
  name: bastions
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-eu-central-1a
  - utility-eu-central-1b
  - utility-eu-central-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-12-04T12:44:41Z
  labels:
    kops.k8s.io/cluster: cluster.co
  name: c4xlNodes
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/cluster.co: ""
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: c4.xlarge
  maxSize: 5
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: c4xlNodes
  role: Node
  subnets:
  - eu-central-1a
  - eu-central-1b
  - eu-central-1c
  taints:
  - dedicated=apiProd:NoSchedule

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-09-06T17:05:29Z
  labels:
    kops.k8s.io/cluster: cluster.co
  name: master-eu-central-1a
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: ""
    kubernetes.io/cluster/cluster.co: owned
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1a
  role: Master
  subnets:
  - eu-central-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-09-06T17:05:30Z
  labels:
    kops.k8s.io/cluster: cluster.co
  name: master-eu-central-1b
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1b
  role: Master
  subnets:
  - eu-central-1b

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-09-06T17:05:31Z
  labels:
    kops.k8s.io/cluster: cluster.co
  name: master-eu-central-1c
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: m4.large
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-central-1c
  role: Master
  subnets:
  - eu-central-1c

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2018-09-06T17:05:31Z
  labels:
    kops.k8s.io/cluster: cluster.co
  name: nodes
spec:
  cloudLabels:
    k8s.io/cluster-autoscaler/enabled: "true"
    kubernetes.io/cluster/cluster.co: ""
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: m4.large
  maxSize: 10
  minSize: 5
  role: Node
  subnets:
  - eu-central-1a
  - eu-central-1b
  - eu-central-1c

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

When trying to do so kops picks my remaining working available master and I’m scared to not have any master online

9. Anything else do we need to know?

journalctl -u kubelet:

Mar 11 13:53:19 ip-172-20-32-79 kubelet[2894]: W0311 13:53:19.427747    2894 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Mar 11 13:53:19 ip-172-20-32-79 kubelet[2894]: E0311 13:53:19.428413    2894 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotR
eady message:docker: network plugin is not ready: cni config uninitialized
Mar 11 13:53:20 ip-172-20-32-79 kubelet[2894]: E0311 13:53:20.277241    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:20 ip-172-20-32-79 kubelet[2894]: E0311 13:53:20.278324    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Mar 11 13:53:20 ip-172-20-32-79 kubelet[2894]: E0311 13:53:20.279649    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: G
et https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: con
nection refused
Mar 11 13:53:21 ip-172-20-32-79 kubelet[2894]: E0311 13:53:21.277794    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:21 ip-172-20-32-79 kubelet[2894]: E0311 13:53:21.278782    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Mar 11 13:53:21 ip-172-20-32-79 kubelet[2894]: E0311 13:53:21.279968    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: G
et https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: con
nection refused
Mar 11 13:53:22 ip-172-20-32-79 kubelet[2894]: E0311 13:53:22.278333    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:22 ip-172-20-32-79 kubelet[2894]: E0311 13:53:22.279275    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Mar 11 13:53:22 ip-172-20-32-79 kubelet[2894]: E0311 13:53:22.280368    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: G
et https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: con
nection refused
Mar 11 13:53:23 ip-172-20-32-79 kubelet[2894]: E0311 13:53:23.278862    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:23 ip-172-20-32-79 kubelet[2894]: E0311 13:53:23.280011    2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused

Am I missing anything?

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 30 (8 by maintainers)

Most upvoted comments

fyi I’ve tried to document a restore process here: https://www.hindenes.com/2019-08-09-Kops-Restore/

trondhindenes on Aug 9, 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot on Jun 11, 2020

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

fejta-bot on Jun 11, 2020

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot on May 12, 2020

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot on Apr 12, 2020

Hi everyone! I had the same issue when was performing upgrade from 1.12.10 to 1.14.8. The errors I saw in logs were about “CNI not ready”, indeed, when I did ifconfig - there were no interfaces rather than lo/eth0 and doing docker ps -a showed that a flannel container had exited some time ago. Also, there obviously were no any config files in /etc/cni/net.d.

I went ahead with restoring etcd volumes from snapshots and put all needed tags (in AWS), however I found that despite you put incorrect value into KubernetesCluster key, a volume gets mounted! So I ended up having master nodes with volumes partially old and partially restored from backup, after that all master nodes were hitting 90% CPU and it was near to impossible to SSH into them. Inside the node, I was seeing that enormous RAM consumption by etcd. After than I scaled down all masters and renamed tags (both key and values) on old volumes to avoid them being mounted. Scaled up and all volumes from snapshots were mounted as expected.

Heads up for this Despite you’re seeing network related errors in logs after your master nodes are running - make sure you give your cluster enough time to start. What I mean by that - do not rush checking the cluster’s health, give it some time to start up. I tested multiple times and the result was stable for me - it took around 10-11minutes for master nodes to become Ready. If I rush to check logs within 2-5 after the startup - I was seeing those network errors.

Summary

Check K8s release notes to make sure there are no breaking changes during the upgrade;
Give your cluster some time to become ready, don’t rush;
Check that you use a compatible kops image;
Troubleshooting hints: kubelet process - systemctl status kubelet; kubelet logs - journalctl -u kubelet; API server logs - cat /var/log/kube-apiserver.log; docker logs - docker ps -a - check for exited containers and then do docker logs <container_name> against those containers;

Useful article: https://itnext.io/kubernetes-master-nodes-backup-for-kops-on-aws-a-step-by-step-guide-4d73a5cd2008

axozoid on Nov 12, 2019