kubernetes: Suddenly getting TLS Handshake timeout on most requests to the api server

I’m unable to use kubectl because of TLS handshake timeout.

kubectl get pods
error: couldn't read version from server: Get https://master-ip/api: net/http: TLS handshake timeout

edit: After several tries I got one to go through, so it’s not happening 100% of the time. But the error has been recorded over 4k times in the logs.

I’ve also noticed the error is showing up often in /var/log/kube-apisever.log from requests from minions.

I0831 12:35:19.950945       8 logs.go:41] http: TLS handshake error from 172.20.0.65:58143: EOF
I0831 12:35:20.599641       8 logs.go:41] http: TLS handshake error from 172.20.0.148:53774: EOF
I0831 12:35:20.601809       8 logs.go:41] http: TLS handshake error from 172.20.0.240:41027: EOF
I0831 12:35:22.272985       8 logs.go:41] http: TLS handshake error from 172.20.0.25:42939: EOF
I0831 12:35:23.210921       8 logs.go:41] http: TLS handshake error from 172.20.0.240:41034: EOF
I0831 12:35:24.520112       8 logs.go:41] http: TLS handshake error from 172.20.0.65:58172: EOF

I also noticed a lot of dial tcp 127.0.0.1:8080: connection refused errors in the logs on various endpoints.

E0831 12:36:54.588879       5 reflector.go:136] Failed to list *api.ResourceQuota: Get http://127.0.0.1:8080/api/v1/resourcequotas: dial tcp 127.0.0.1:8080: connection refused
E0831 12:36:54.589161       5 reflector.go:136] Failed to list *api.Secret: Get http://127.0.0.1:8080/api/v1/secrets?fieldSelector=type%3Dkubernetes.io%2Fservice-account-token: dial tcp 127.0.0.1:8080: connection refused
E0831 12:36:54.628359       5 reflector.go:136] Failed to list *api.ServiceAccount: Get http://127.0.0.1:8080/api/v1/serviceaccounts: dial tcp 127.0.0.1:8080: connection refused
E0831 12:36:54.628471       5 reflector.go:136] Failed to list *api.LimitRange: Get http://127.0.0.1:8080/api/v1/limitranges: dial tcp 127.0.0.1:8080: connection refused
E0831 12:36:54.628608       5 reflector.go:136] Failed to list *api.Namespace: Get http://127.0.0.1:8080/api/v1/namespaces: dial tcp 127.0.0.1:8080: connection refused
E0831 12:36:54.628669       5 reflector.go:136] Failed to list *api.Namespace: Get http://127.0.0.1:8080/api/v1/namespaces: dial tcp 127.0.0.1:8080: connection refused

If it’s relevant, the ephemeral filesystem on /mnt/ephemeral/kubernetes is 99% on one of my minions. Most of my kube-system pods (kube-ui, kube-dns, elasticsearch, etc) are running that minion. It’s full because of elasticsearch and heapster empity-dir volumes, and the mount is only 3.75GB. This filesystem being full caused other problems this weekend, including containers from the kube-dns pod shutting down which in turn brought down all of my other production pods over the weekend.

About this issue

Original URL
State: closed
Created 9 years ago
Reactions: 45
Comments: 72 (30 by maintainers)

Links to this issue

Why isn't 1GB of RAM enough for my master node?

Most upvoted comments

Has anyone been able to solve it? We are also hitting the same issue.

+37

Pensu on Sep 9, 2016

Please re-post your question to stackoverflow.

We are trying to consolidate the channels to which questions for help/support are posted so that we can improve our efficiency in responding to your requests, and to make it easier for you to find answers to frequently asked questions and how to address common use cases.

We regularly see messages posted in multiple forums, with the full response thread only in one place or, worse, spread across multiple forums. Also, the large volume of support issues on github is making it difficult for us to use issues to identify real bugs.

The Kubernetes team scans stackoverflow on a regular basis, and will try to ensure your questions don’t go unanswered.

Before posting a new question, please search stackoverflow for answers to similar questions, and also familiarize yourself with:

Again, thanks for using Kubernetes.

The Kubernetes Team

+25

lavalamp on Aug 31, 2015

@lavalamp we’re running into the same problem, unfortunately, it doesn’t seem any SO post has been created for this issue.

We’re running on GKE without any manual modifications, should it still be considered a “question” on SO, or is this an issue related to Kubernetes?

In our case, we had a script use a service-account on Kubernetes to access the API using kubectl, but suddenly we started seeing failures, related to the command not being able to access the API server:

$ kubectl get pod
error: couldn't read version from server: Get https://10.135.240.1:443/api: net/http: TLS handshake timeout

+12

JeanMertz on Sep 26, 2015

set https_proxy cause this issue for me

+10

jasonlucn on Apr 3, 2019

@rroopreddy Your master ran out of memory. Force reboot it via your cluster.

On the master, after it comes back up:

sudo apt-get update
sudo apt-get install swapspace

This will automatically scale swap so that you never run out of memory and deadlock yourself out.

paralin on Nov 9, 2015

Tip: Ensure your proxy settings aren’t messed up.

Verify the path to the master, e.g.:

$ kubectl cluster-info
Kubernetes master is running at https://10.196.10.229:6443
...

$ curl -v https://10.196.10.229:6443/

The output may be informative as to what the source of the issue is.

jaytaylor on Oct 27, 2017

@freehan this is a P1 for a year. I mentioned it in my post too. Any plans 😃 ?

jeffpeiyt on Nov 30, 2016

Just started seeing this again @roberthbailey

Oct 21 06:31:43 ip-172-20-0-115 kubelet[25950]: E1021 06:31:43.245829   25950 kubelet.go:2259] Error updating node status, will retry: Put https://172.20.0.9/api/v1/nodes/ip-172-20-0-115.us-west-2.compute.internal/status: net/http: TLS handshake timeout
Oct 21 06:31:46 ip-172-20-0-115 kubelet[25950]: E1021 06:31:46.303538   25950 reflector.go:206] pkg/kubelet/kubelet.go:211: Failed to watch *api.Service: Get https://172.20.0.9/api/v1/watch/services?resourceVersion=1904: net/http: TLS handshake timeout

Any ideas?

paralin on Oct 21, 2015

“TLS handshake timeout” is an extremely generic error message that indicates something is wrong in the networking path between your client and server.

wrong proxy settings
apiserver is not running
apiserver is not healthy and load balancer is therefore not sending it traffic
load balancer did something confusing with your traffic

This issue is 5 years old and is probably not a good place to get help.

lavalamp on Jan 19, 2021

Not sure if this helps you guys, this happens to me only when I’m trying to access the cluster using university, library or coffeeshop wifi.

Seems fine with office and home ISPs. It appears to be a firewall some one them are running (specially libraries that monitor data)

sourcesoft on Mar 23, 2018

We got the same issue. We write a custom controller using client-go, and deployment it in k8s cluster. Our custom controller can not List&Watch from APi server due to the TLS handshake timeout error.

We use the following code snippet to create a clientset.

        // creates the in-cluster config
	config, err := rest.InClusterConfig()
	if err != nil {
		panic(err.Error())
	}
	// creates the clientset
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}

mqliang on Jan 9, 2017

@dchen1107 we are in an OpenStack based private cloud and face the same issue

jeffpeiyt on Mar 15, 2016

We got the same issue. It make the nodes into not ready state quite often

jeffpeiyt on Jan 22, 2016

Hey, I am using kubernetes on aws and getting the error continuously.

It was working fine for two days and after that when I try kubectl command, it gave me the error

error: couldn't read version from server: Get https://54.76.223.85/api: dial tcp 54.76.223.85:443: connection refused

After trying couple of times, it gave me another error

error: couldn't read version from server: Get https://54.76.223.85/api: net/http: TLS handshake timeout

Then suddendly after some time the kubernetes worked fine. It started giving me the results. But , it again gave me same error after couple of minutes. This error is occuring continuously. Why is this happening?? It seems to be working fine, but I am getting errors most of the time I execute the command.

akrmhrjn on Nov 30, 2015

I am seeing the exact same errors today on our internal gitlab k8s cluster. All kubectl commands were working just half an hour ago. And it’s been error’d out for 5 minutes and back again. Other than some networking I/O plummeting to the floor according to the Google Cloud dashboard (see below), no any significant events happened in the past half an hour to the best of my knowledge. Could this be a networking I/O caused issue?

screen shot 2016-07-27 at 6 30 51 pm

$ kubectl --context gke_internal-gitlab_us-central1-a_gitlab-k8s get svc
Unable to connect to the server: net/http: TLS handshake timeout
$ kubectl --context gke_internal-gitlab_us-central1-a_gitlab-k8s get svc
Unable to connect to the server: net/http: TLS handshake timeout
$ kubectl --context gke_internal-gitlab_us-central1-a_gitlab-k8s get ep
Unable to connect to the server: net/http: TLS handshake timeout
$ kubectl --context gke_internal-gitlab_us-central1-a_gitlab-k8s get pod
The connection to the server 104.197.xxx.xxx was refused - did you specify the right host or port?
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.5", GitCommit:"25eb53b54e08877d3789455964b3e97bdd3f3bce", GitTreeState:"clean"}
Server Version: version.Info{Major:"1", Minor:"2", GitVersion:"v1.2.5", GitCommit:"25eb53b54e08877d3789455964b3e97bdd3f3bce", GitTreeState:"clean"}
$ gcloud version
Google Cloud SDK 118.0.0

bq 2.0.24
bq-nix 2.0.24
core 2016.07.18
core-nix 2016.03.28
gcloud 
gsutil 4.19
gsutil-nix 4.18
kubectl 
kubectl-darwin-x86_64 1.2.5

Running k8s on GCP. kubectl is installed via gcloud components install kubectl

More updates:

So the TLS handshake timeout error was intermittent for the most part of yesterday. Today I am getting a different erroneous behavior on the EXACT SAME k8s cluster. This time, it seems like kubectl is trapping in an infinite loop even though the underlying curl call was getting HTTP 200 to the API endpoint. And I had to ctrl-c out of that curl call loop. Without the debugging flag, kubectl delete job command simply hangs indefinitely.

$ kubectl --context gke_internal-gitlab_us-central1-a_gitlab-k8s delete job letsencrypt --v=100
I0728 14:04:16.028036   40684 loader.go:229] Config loaded from file /Users/ye/.kube/config
I0728 14:04:16.181131   40684 round_trippers.go:267] curl -k -v -XGET  -H "User-Agent: kubectl/v1.2.5 (darwin/amd64) kubernetes/25eb53b" -H "Authorization: Basic Zm9vOmJhcg==" -H "Accept: application/json, */*" https://104.197.xxx.xxx/api
I0728 14:04:16.367478   40684 round_trippers.go:286] GET https://104.197.xxx.xxx/api 200 OK in 186 milliseconds
I0728 14:04:16.367527   40684 round_trippers.go:292] Response Headers:
I0728 14:04:16.367544   40684 round_trippers.go:295]     Content-Type: application/json
I0728 14:04:16.367626   40684 round_trippers.go:295]     Date: Thu, 28 Jul 2016 18:04:16 GMT
I0728 14:04:16.367670   40684 round_trippers.go:295]     Content-Length: 132
I0728 14:04:16.367888   40684 request.go:870] Response Body: {"kind":"APIVersions","versions":["v1"],"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"104.197.xxx.xxx"}]}
I0728 14:04:16.368470   40684 round_trippers.go:267] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.2.5 (darwin/amd64) kubernetes/25eb53b" -H "Authorization: Basic Zm9vOmJhcg==" https://104.197.xxx.xxx/apis
I0728 14:04:16.421439   40684 round_trippers.go:286] GET https://104.197.xxx.xxx/apis 200 OK in 52 milliseconds
I0728 14:04:16.421469   40684 round_trippers.go:292] Response Headers:
I0728 14:04:16.421475   40684 round_trippers.go:295]     Content-Type: application/json
I0728 14:04:16.421481   40684 round_trippers.go:295]     Date: Thu, 28 Jul 2016 18:04:16 GMT
I0728 14:04:16.421489   40684 round_trippers.go:295]     Content-Length: 766
I0728 14:04:16.421525   40684 request.go:870] Response Body: {"kind":"APIGroupList","groups":[{"name":"autoscaling","versions":[{"groupVersion":"autoscaling/v1","version":"v1"}],"preferredVersion":{"groupVersion":"autoscaling/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"104.197.xxx.xxx"}]},{"name":"batch","versions":[{"groupVersion":"batch/v1","version":"v1"}],"preferredVersion":{"groupVersion":"batch/v1","version":"v1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"104.197.xxx.xxx"}]},{"name":"extensions","versions":[{"groupVersion":"extensions/v1beta1","version":"v1beta1"}],"preferredVersion":{"groupVersion":"extensions/v1beta1","version":"v1beta1"},"serverAddressByClientCIDRs":[{"clientCIDR":"0.0.0.0/0","serverAddress":"104.197.xxx.xxx"}]}]}
I0728 14:04:16.423637   40684 round_trippers.go:267] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.2.5 (darwin/amd64) kubernetes/25eb53b" -H "Authorization: Basic Zm9vOmJhcg==" https://104.197.xxx.xxx/apis/extensions/v1beta1/namespaces/default/jobs/letsencrypt
I0728 14:04:16.495678   40684 round_trippers.go:286] GET https://104.197.xxx.xxx/apis/extensions/v1beta1/namespaces/default/jobs/letsencrypt 200 OK in 72 milliseconds
I0728 14:04:16.495702   40684 round_trippers.go:292] Response Headers:
I0728 14:04:16.495708   40684 round_trippers.go:295]     Content-Type: application/json
I0728 14:04:16.495715   40684 round_trippers.go:295]     Date: Thu, 28 Jul 2016 18:04:16 GMT
I0728 14:04:16.495762   40684 request.go:870] Response Body: {"kind":"Job","apiVersion":"extensions/v1beta1","metadata":{"name":"letsencrypt","namespace":"default","selfLink":"/apis/extensions/v1beta1/namespaces/default/jobs/letsencrypt","uid":"964e65af-54ec-11e6-975a-42010af0000f","resourceVersion":"2227489","creationTimestamp":"2016-07-28T17:56:19Z","labels":{"controller-uid":"964e65af-54ec-11e6-975a-42010af0000f","job-name":"letsencrypt","name":"gitlab"},"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"kind\":\"Job\",\"apiVersion\":\"batch/v1\",\"metadata\":{\"name\":\"letsencrypt\",\"creationTimestamp\":null},\"spec\":{\"template\":{\"metadata\":{\"name\":\"gitlab\",\"creationTimestamp\":null,\"labels\":{\"name\":\"gitlab\"}},\"spec\":{\"volumes\":[{\"name\":\"le-webroot\",\"emptyDir\":{\"medium\":\"Memory\"}},{\"name\":\"le-certificates\",\"emptyDir\":{\"medium\":\"Memory\"}},{\"name\":\"le-logs\",\"emptyDir\":{\"medium\":\"Memory\"}}],\"containers\":[{\"name\":\"le-nginx\",\"image\":\"gcr.io/internal-gitlab/nginx:ye-1469482840\",\"command\":[\"nginx\",\"-c\",\"/etc/nginx/conf.d/letsencrypt.conf\"],\"ports\":[{\"name\":\"http\",\"containerPort\":80,\"protocol\":\"TCP\"}],\"resources\":{\"requests\":{\"cpu\":\"10m\"}},\"volumeMounts\":[{\"name\":\"le-webroot\",\"mountPath\":\"/var/lib/letsencrypt\"},{\"name\":\"le-certificates\",\"mountPath\":\"/etc/letsencrypt\"},{\"name\":\"le-logs\",\"mountPath\":\"/var/log/letsencrypt\"}]},{\"name\":\"le-certbot\",\"image\":\"gcr.io/internal-gitlab/letsencrypt:ye-1469726125\",\"command\":[\"./letsencrypt.sh\"],\"env\":[{\"name\":\"LE_WWW_DOMAIN\",\"value\":\"gitlab.mvnctl.net\"},{\"name\":\"MY_POD_IP\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"status.podIP\"}}},{\"name\":\"CRT_SECRET_KEY\",\"value\":\"gitlab.crt\"},{\"name\":\"KEY_SECRET_KEY\",\"value\":\"gitlab.key\"}],\"resources\":{\"requests\":{\"cpu\":\"15m\"}},\"volumeMounts\":[{\"name\":\"le-webroot\",\"mountPath\":\"/var/lib/letsencrypt\"},{\"name\":\"le-certificates\",\"mountPath\":\"/etc/letsencrypt\"},{\"name\":\"le-logs\",\"mountPath\":\"/var/log/letsencrypt\"}]}],\"restartPolicy\":\"Never\"}}},\"status\":{}}"}},"spec":{"parallelism":0,"completions":1,"selector":{"matchLabels":{"controller-uid":"964e65af-54ec-11e6-975a-42010af0000f"}},"autoSelector":true,"template":{"metadata":{"name":"gitlab","creationTimestamp":null,"labels":{"controller-uid":"964e65af-54ec-11e6-975a-42010af0000f","job-name":"letsencrypt","name":"gitlab"}},"spec":{"volumes":[{"name":"le-webroot","emptyDir":{"medium":"Memory"}},{"name":"le-certificates","emptyDir":{"medium":"Memory"}},{"name":"le-logs","emptyDir":{"medium":"Memory"}}],"containers":[{"name":"le-nginx","image":"gcr.io/internal-gitlab/nginx:ye-1469482840","command":["nginx","-c","/etc/nginx/conf.d/letsencrypt.conf"],"ports":[{"name":"http","containerPort":80,"protocol":"TCP"}],"resources":{"requests":{"cpu":"10m"}},"volumeMounts":[{"name":"le-webroot","mountPath":"/var/lib/letsencrypt"},{"name":"le-certificates","mountPath":"/etc/letsencrypt"},{"name":"le-logs","mountPath":"/var/log/letsencrypt"}],"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"IfNotPresent"},{"name":"le-certbot","image":"gcr.io/internal-gitlab/letsencrypt:ye-1469726125","command":["./letsencrypt.sh"],"env":[{"name":"LE_WWW_DOMAIN","value":"gitlab.mvnctl.net"},{"name":"MY_POD_IP","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"status.podIP"}}},{"name":"CRT_SECRET_KEY","value":"gitlab.crt"},{"name":"KEY_SECRET_KEY","value":"gitlab.key"}],"resources":{"requests":{"cpu":"15m"}},"volumeMounts":[{"name":"le-webroot","mountPath":"/var/lib/letsencrypt"},{"name":"le-certificates","mountPath":"/etc/letsencrypt"},{"name":"le-logs","mountPath":"/var/log/letsencrypt"}],"terminationMessagePath":"/dev/termination-log","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","securityContext":{}}}},"status":{"conditions":[{"type":"Complete","status":"True","lastProbeTime":"2016-07-28T18:00:02Z","lastTransitionTime":"2016-07-28T18:00:02Z"}],"startTime":"2016-07-28T17:56:19Z","completionTime":"2016-07-28T18:00:02Z","active":1,"succeeded":1}}
^C

A longer log capture is here

ye on Jul 28, 2016

@roberthbailey If I restart the kubelet, it works perfectly fine. But, after 2 or 3 days of continuously running, again it goes back to NotReady state.

I am guessing its something related to connection pool issue somewhere. Any thoughts?

I saw this issue https://github.com/kubernetes/kubernetes/issues/17641 and tried to check CLOSE_WAIT on the kubelet node and it returned empty.

prashantchitta on Jan 27, 2016

Once we setup cluster, Minion goes to NotReady state after 2 or 3 days with the following error. Error updating node status, error getting node “{node-name}”: Get https://{ip}:6443/api/v1/nodes/{node-name}: dial tcp {ip}:6443: network is unreachable

To see if its a network connectivity issue, i ssh’ed into the minon and did a curl of the exact url above and it works. So network connectivity is ruled out.

Any pointers why the node goes to NotReady state and unable to post the status to api server?

prashantchitta on Jan 27, 2016

@roberthbailey So, in gcloud it is working fine from 5 days. And last friday I started on aws as well, this time client on aws instance. Still the same error.

error: couldn't read version from server: Get https://52.19.115.139/api: dial tcp 52.19.115.139:443: connection refused

Can you please tell me now how can I ssh to master node on aws so that I can attach you with kube-apiserver.log?

akrmhrjn on Dec 7, 2015

Hi,

I used vagrant in mac and getting below error. can someone help me.

Waiting for each minion to be registered with cloud provider error: couldn’t read version from server: Get https://10.245.1.2/api: net/http: TLS handshake timeout

rajeswarr on Nov 18, 2015

@roberthbailey I haven’t been looking for it but I haven’t seen it either. I’ve also upgraded the cluster to HEAD multiple times.

I think the issue is caused by the master running out of memory and no longer responding to requests. That has been fixed in my cluster.

paralin on Oct 28, 2015