kubeadm: coredns fails with invalid kube-api endpoint

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): kubeadm version: &version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.2”, GitCommit:“17c77c7898218073f14c8d573582e8d2313dc740”, GitTreeState:“clean”, BuildDate:“2018-10-24T06:51:33Z”, GoVersion:“go1.10.4”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.2”, GitCommit:“17c77c7898218073f14c8d573582e8d2313dc740”, GitTreeState:“clean”, BuildDate:“2018-10-24T06:54:59Z”, GoVersion:“go1.10.4”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“12”, GitVersion:“v1.12.2”, GitCommit:“17c77c7898218073f14c8d573582e8d2313dc740”, GitTreeState:“clean”, BuildDate:“2018-10-24T06:43:59Z”, GoVersion:“go1.10.4”, Compiler:“gc”, Platform:“linux/amd64”}
Cloud provider or hardware configuration: OpenStack
OS (e.g. from /etc/os-release): NAME=“Ubuntu” VERSION=“18.04.1 LTS (Bionic Beaver)” ID=ubuntu ID_LIKE=debian PRETTY_NAME=“Ubuntu 18.04.1 LTS” VERSION_ID=“18.04” HOME_URL=“https://www.ubuntu.com/” SUPPORT_URL=“https://help.ubuntu.com/” BUG_REPORT_URL=“https://bugs.launchpad.net/ubuntu/” PRIVACY_POLICY_URL=“https://www.ubuntu.com/legal/terms-and-policies/privacy-policy” VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a): Linux kube-apiserver-1 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Others:

What happened?

$ kubectl -n kube-system get pods

NAME                                       READY     STATUS    RESTARTS   AGE
coredns-576cbf47c7-2vh8c                   1/1       Running   167        13h
coredns-576cbf47c7-q88fm                   1/1       Running   167        13h
kube-apiserver-kube-apiserver-1            1/1       Running   0          13h
kube-controller-manager-kube-apiserver-1   1/1       Running   2          13h
kube-flannel-ds-amd64-bmvs9                1/1       Running   0          13h
kube-proxy-dkkqs                           1/1       Running   0          13h
kube-scheduler-kube-apiserver-1            1/1       Running   2          13h

$ kubectl -n kube-system logs coredns-576cbf47c7-2vh8c

E1120 16:31:29.672203       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:31:29.672382       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:31:29.673053       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.672931       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:355: Failed to list *v1.Namespace: Get https://10.96.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.681605       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:348: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E1120 16:32:00.682868       1 reflector.go:205] github.com/coredns/coredns/plugin/kubernetes/controller.go:350: Failed to list *v1.Endpoints: Get https://10.96.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout

What you expected to happen?

Healthy coredns after installation of pod-network

How to reproduce it (as minimally and precisely as possible)?

Install Kubernetes using kubeadm following instructions on: https://kubernetes.io/docs/setup/independent/high-availability/ Download flannel pod network add-on from: https://raw.githubusercontent.com/coreos/flannel/bc79dd1505b0c8681ece4de4c0d86c5cd2643275/Documentation/kube-flannel.yml Add the following environment variables to kube-flannel.yml:

… - name: KUBERNETES_SERVICE_HOST value: “kube-apiserver.config-service.com” #ip address of the host where kube-apiservice is running - name: KUBERNETES_SERVICE_PORT value: “6443” …

Apply modified kube-flannel.yml to kubernetes: kubectl apply -f kube-flannel.yml

Anything else we need to know?

The external load balancer endpoint is: kube-apiserver.config-service.com this has been configured to be a TCP pass through for port 6443 which works well for a three master nodes.

$ kubectl get nodes

NAME               STATUS    ROLES     AGE       VERSION
kube-apiserver-1   Ready     master    13h       v1.12.2
kube-apiserver-2   Ready     master    2m        v1.12.2
kube-apiserver-3   Ready     master    14s       v1.12.2

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 27 (6 by maintainers)

Most upvoted comments

OK, for those of you that get to this point and are frustrated TRUST ME when I tell you that is is easier to destroy the etcd cluster and start from scratch than try to figure out this network issue. After three days I got tired of debugging and troubleshooting and just decided to start from scratch which I had done a few times with the masters and the nodes however I left the etcd cluster intact.

I decided to break down and destroy EVERYTHING including the etcd cluster and guess what? Now I have a fully working cluster:

$ kubectl -n kube-system get pods
NAME                                       READY     STATUS    RESTARTS   AGE
coredns-576cbf47c7-5tlwd                   1/1       Running   0          37m
coredns-576cbf47c7-vsj2z                   1/1       Running   0          37m
kube-apiserver-kube-apiserver-1            1/1       Running   0          37m
kube-apiserver-kube-apiserver-2            1/1       Running   0          15m
kube-apiserver-kube-apiserver-3            1/1       Running   0          14m
kube-controller-manager-kube-apiserver-1   1/1       Running   0          37m
kube-controller-manager-kube-apiserver-2   1/1       Running   0          15m
kube-controller-manager-kube-apiserver-3   1/1       Running   0          14m
kube-flannel-ds-amd64-2dtln                1/1       Running   0          8m
kube-flannel-ds-amd64-75bgw                1/1       Running   0          10m
kube-flannel-ds-amd64-cpcjv                1/1       Running   0          35m
kube-flannel-ds-amd64-dlwww                1/1       Running   0          8m
kube-flannel-ds-amd64-dwkjb                1/1       Running   1          15m
kube-flannel-ds-amd64-msx9l                1/1       Running   0          14m
kube-flannel-ds-amd64-smhfj                1/1       Running   0          9m
kube-proxy-5rdk7                           1/1       Running   0          10m
kube-proxy-8gfd7                           1/1       Running   0          9m
kube-proxy-9kfxv                           1/1       Running   0          37m
kube-proxy-c22dl                           1/1       Running   0          8m
kube-proxy-gkvz5                           1/1       Running   0          14m
kube-proxy-pxlrp                           1/1       Running   0          15m
kube-proxy-vmp5h                           1/1       Running   0          8m
kube-scheduler-kube-apiserver-1            1/1       Running   0          37m
kube-scheduler-kube-apiserver-2            1/1       Running   0          15m
kube-scheduler-kube-apiserver-3            1/1       Running   0          14m
It would have been a lot easier to start from scratch and avoid ALL these hours of troubleshooting than try to figure out this network problem. I just wish there was a script/method to reset the etcd DB so we did not have to rebuild from scratch. That would be an awesome tool, something like: kubeadm reset --etcd=https://etcd-cluster.control-service.com:2379

@neolit123 What do you think about an etcd reset option for an external etcd cluster?

haha! I know which the problem occured!

ube-system   calico-etcd-h8h46                          1/1     Running   1          17h
kube-system   calico-kube-controllers-85cf9c8b79-q78b2   1/1     Running   2          17h
kube-system   calico-node-5pvsw                          2/2     Running   2          17h
kube-system   calico-node-7xvn9                          2/2     Running   2          17h
kube-system   calico-node-85j5x                          2/2     Running   3          17h
kube-system   coredns-576cbf47c7-cw8lr                   1/1     Running   1          17h
kube-system   coredns-576cbf47c7-hvt7z                   1/1     Running   1          17h
kube-system   etcd-k8s-node131                           1/1     Running   1          17h
kube-system   kube-apiserver-k8s-node131                 1/1     Running   1          17h
kube-system   kube-controller-manager-k8s-node131        1/1     Running   1          17h
kube-system   kube-proxy-458vk                           1/1     Running   1          17h
kube-system   kube-proxy-n852v                           1/1     Running   1          17h
kube-system   kube-proxy-p5d5g                           1/1     Running   1          17h
kube-system   kube-scheduler-k8s-node131                 1/1     Running   1          17h
kube-system   traefik-ingress-controller-fkhwk           1/1     Running   0          18m
kube-system   traefik-ingress-controller-kxr6v           1/1     Running   0          18m

Now,the cluster is healthy. reason: deploy a cluster:

1. kubeadm init
2.deploy network plugin
3.kubeadm join

I mixed step2 and step3! if someone have the same problem,you can try it.

liaoqiArno on Nov 23, 2018

have the same problem here … on 1.15.2

matthewygf on Oct 10, 2019

sorry for your troubles. i think your database got in a corrupted state for some reason and keeping it around was probably not a good idea. also this is hard to debug…

there are already ways to reset etcd (but on local nodes): https://groups.google.com/forum/#!topic/coreos-user/qcwLNqou4qQ

also sig-cluster-lifecycle (the maintainers of kubeadm) are working of a tool called: etcdadm that will most likely have this functionality.

neolit123 on Nov 22, 2018

@AlexMorreale we ended up bouncing the coredns pods to another node and it worked. one other thing we changed, not sure whether it made an impact was, we found out our apiserver pod was on hostNetwork, but dnsConfig was not ClusterFirstWithHostNet.

matthewygf on Oct 16, 2019