kops: Master failing to join - connection refused
1. What kops version are you running? The command kops version, will display
this information.
error querying kubernetes version: Get https://127.0.0.1/version?timeout=32s: dial tcp 127.0.0.1:443: connect: connection refused
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.4", GitCommit:"c27b913fddd1a6c480c229191a087698aa92f0b1", GitTreeState:"clean", BuildDate:"2019-02-28T13:37:52Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
kops upgrade cluster
kops update cluster --yes
kops rolling-update --yes
5. What happened after the commands executed?
ops rolling-update cluster --yes
Using cluster from kubectl context: spaceti.co
NAME STATUS NEEDUPDATE READY MIN MAX NODES
bastions Ready 0 1 1 1 0
c4xlNodes NeedsUpdate 3 0 3 5 3
master-eu-central-1a NeedsUpdate 1 0 1 1 0
master-eu-central-1b NeedsUpdate 1 0 1 1 1
master-eu-central-1c NeedsUpdate 1 0 1 1 0
nodes NeedsUpdate 7 0 5 10 7
W0311 14:49:27.065301 15975 instancegroups.go:175] Skipping drain of instance "i-058c7724d9149892a", because it is not registered in kubernetes
W0311 14:49:27.065355 15975 instancegroups.go:183] no kubernetes Node associated with i-058c7724d9149892a, skipping node deletion
I0311 14:49:27.065375 15975 instancegroups.go:301] Stopping instance "i-058c7724d9149892a", in group "master-eu-central-1a.masters.spaceti.co" (this may take a while).
I0311 14:49:27.222670 15975 instancegroups.go:198] waiting for 5m0s after terminating instance
I0311 14:54:27.223004 15975 instancegroups.go:209] Validating the cluster.
I0311 14:54:28.726834 15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
I0311 14:55:00.005545 15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
I0311 14:55:30.000604 15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
I0311 14:55:59.560848 15975 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine "i-001a36920064a091c" has not yet joined cluster.
6. What did you expect to happen?
I expected master nodes to join the cluster
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2018-09-06T17:05:29Z
name: cluster.co
spec:
additionalPolicies:
node: |
[
{"Effect":"Allow","Action":["autoscaling:DescribeAutoScalingGroups","autoscaling:DescribeAutoScalingInstances","autoscaling:DescribeLaunchConfigurations","autoscaling:DescribeTags","autoscaling:SetDesiredCapacity","autoscaling:TerminateInstanceInAutoScalingGroup"],"Resource":"*"},
{
"Effect": "Allow",
"Action": [
"sts:AssumeRole"
],
"Resource": [
"arn:aws:iam::595924049331:role/k8s-*"
]
}
]
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://private-state-store/cluster.co
dnsZone: cluster.co
encryptionConfig: true
etcdClusters:
- etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
name: main
- etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
admissionControl:
- NamespaceLifecycle
- LimitRanger
- ServiceAccount
- PersistentVolumeLabel
- DefaultStorageClass
- DefaultTolerationSeconds
- MutatingAdmissionWebhook
- ValidatingAdmissionWebhook
- ResourceQuota
- NodeRestriction
- Priority
kubelet:
anonymousAuth: false
kubernetesApiAccess:
- 0.0.0.0/0
kubernetesVersion: 1.11.7
masterInternalName: api.internal.cluster.co
masterPublicName: api.cluster.co
networkCIDR: 172.20.0.0/16
networking:
calico: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 0.0.0.0/0
subnets:
- cidr: 172.20.32.0/19
name: eu-central-1a
type: Private
zone: eu-central-1a
- cidr: 172.20.64.0/19
name: eu-central-1b
type: Private
zone: eu-central-1b
- cidr: 172.20.96.0/19
name: eu-central-1c
type: Private
zone: eu-central-1c
- cidr: 172.20.0.0/22
name: utility-eu-central-1a
type: Utility
zone: eu-central-1a
- cidr: 172.20.4.0/22
name: utility-eu-central-1b
type: Utility
zone: eu-central-1b
- cidr: 172.20.8.0/22
name: utility-eu-central-1c
type: Utility
zone: eu-central-1c
topology:
bastion:
bastionPublicName: bastion.cluster.co
dns:
type: Public
masters: private
nodes: private
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-09-06T17:05:31Z
labels:
kops.k8s.io/cluster: cluster.co
name: bastions
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t2.micro
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: bastions
role: Bastion
subnets:
- utility-eu-central-1a
- utility-eu-central-1b
- utility-eu-central-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-12-04T12:44:41Z
labels:
kops.k8s.io/cluster: cluster.co
name: c4xlNodes
spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: "true"
kubernetes.io/cluster/cluster.co: ""
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: c4.xlarge
maxSize: 5
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: c4xlNodes
role: Node
subnets:
- eu-central-1a
- eu-central-1b
- eu-central-1c
taints:
- dedicated=apiProd:NoSchedule
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-09-06T17:05:29Z
labels:
kops.k8s.io/cluster: cluster.co
name: master-eu-central-1a
spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: ""
kubernetes.io/cluster/cluster.co: owned
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: m4.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-eu-central-1a
role: Master
subnets:
- eu-central-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-09-06T17:05:30Z
labels:
kops.k8s.io/cluster: cluster.co
name: master-eu-central-1b
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: m4.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-eu-central-1b
role: Master
subnets:
- eu-central-1b
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-09-06T17:05:31Z
labels:
kops.k8s.io/cluster: cluster.co
name: master-eu-central-1c
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: m4.large
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-eu-central-1c
role: Master
subnets:
- eu-central-1c
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-09-06T17:05:31Z
labels:
kops.k8s.io/cluster: cluster.co
name: nodes
spec:
cloudLabels:
k8s.io/cluster-autoscaler/enabled: "true"
kubernetes.io/cluster/cluster.co: ""
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: m4.large
maxSize: 10
minSize: 5
role: Node
subnets:
- eu-central-1a
- eu-central-1b
- eu-central-1c
8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
When trying to do so kops picks my remaining working available master and I’m scared to not have any master online
9. Anything else do we need to know?
journalctl -u kubelet:
Mar 11 13:53:19 ip-172-20-32-79 kubelet[2894]: W0311 13:53:19.427747 2894 cni.go:172] Unable to update cni config: No networks found in /etc/cni/net.d/
Mar 11 13:53:19 ip-172-20-32-79 kubelet[2894]: E0311 13:53:19.428413 2894 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotR
eady message:docker: network plugin is not ready: cni config uninitialized
Mar 11 13:53:20 ip-172-20-32-79 kubelet[2894]: E0311 13:53:20.277241 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:20 ip-172-20-32-79 kubelet[2894]: E0311 13:53:20.278324 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Mar 11 13:53:20 ip-172-20-32-79 kubelet[2894]: E0311 13:53:20.279649 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: G
et https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: con
nection refused
Mar 11 13:53:21 ip-172-20-32-79 kubelet[2894]: E0311 13:53:21.277794 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:21 ip-172-20-32-79 kubelet[2894]: E0311 13:53:21.278782 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Mar 11 13:53:21 ip-172-20-32-79 kubelet[2894]: E0311 13:53:21.279968 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: G
et https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: con
nection refused
Mar 11 13:53:22 ip-172-20-32-79 kubelet[2894]: E0311 13:53:22.278333 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:22 ip-172-20-32-79 kubelet[2894]: E0311 13:53:22.279275 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Mar 11 13:53:22 ip-172-20-32-79 kubelet[2894]: E0311 13:53:22.280368 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: G
et https://127.0.0.1/api/v1/pods?fieldSelector=spec.nodeName%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: con
nection refused
Mar 11 13:53:23 ip-172-20-32-79 kubelet[2894]: E0311 13:53:23.278862 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get http
s://127.0.0.1/api/v1/nodes?fieldSelector=metadata.name%3Dip-172-20-32-79.eu-central-1.compute.internal&limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connectio
n refused
Mar 11 13:53:23 ip-172-20-32-79 kubelet[2894]: E0311 13:53:23.280011 2894 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get h
ttps://127.0.0.1/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: connect: connection refused
Am I missing anything?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 30 (8 by maintainers)
fyi I’ve tried to document a restore process here: https://www.hindenes.com/2019-08-09-Kops-Restore/
@fejta-bot: Closing this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
Stale issues rot after 30d of inactivity. Mark the issue as fresh with
/remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.If this issue is safe to close now please do so with
/close.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Issues go stale after 90d of inactivity. Mark the issue as fresh with
/remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.If this issue is safe to close now please do so with
/close.Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Hi everyone! I had the same issue when was performing upgrade from
1.12.10to1.14.8. The errors I saw in logs were about “CNI not ready”, indeed, when I didifconfig- there were no interfaces rather than lo/eth0 and doingdocker ps -ashowed that aflannelcontainer had exited some time ago. Also, there obviously were no any config files in/etc/cni/net.d.I went ahead with restoring
etcdvolumes from snapshots and put all needed tags (in AWS), however I found that despite you put incorrect value intoKubernetesClusterkey, a volume gets mounted! So I ended up having master nodes with volumes partially old and partially restored from backup, after that all master nodes were hitting 90% CPU and it was near to impossible to SSH into them. Inside the node, I was seeing that enormous RAM consumption byetcd. After than I scaled down all masters and renamed tags (both key and values) on old volumes to avoid them being mounted. Scaled up and all volumes from snapshots were mounted as expected.Heads up for this Despite you’re seeing network related errors in logs after your master nodes are running - make sure you give your cluster enough time to start. What I mean by that - do not rush checking the cluster’s health, give it some time to start up. I tested multiple times and the result was stable for me - it took around 10-11minutes for master nodes to become
Ready. If I rush to check logs within 2-5 after the startup - I was seeing those network errors.Summary
systemctl status kubelet; kubelet logs -journalctl -u kubelet; API server logs -cat /var/log/kube-apiserver.log; docker logs -docker ps -a- check for exited containers and then dodocker logs <container_name>against those containers;Useful article: https://itnext.io/kubernetes-master-nodes-backup-for-kops-on-aws-a-step-by-step-guide-4d73a5cd2008