kubernetes: kubectl does not retry after TLS handshake timeout

What happened:

One of our three control plane IPs is unresponsive. On my local machine, what I observe is sporadically it will lag for about 10 seconds, but otherwise works fine. This is because the Go standard library divides the 30 second dial timeout over the 3 IPs, and when the first times out it falls back to the second one.

Further testing shows that if the entire TCP dial times out, then kubectl itself will retry.

However, our build server is behind a firewall. Because of this, what happens there is the TCP dial works but the TLS handshake times out after 10 seconds. When this happens, kubectl treats it as fatal and does not attempt to retry.

What you expected to happen:

kubectl should retry if the TLS handshake times out. (It should start over with a fresh TCP dial.)

How to reproduce it (as minimally and precisely as possible):

I don’t know how to force this issue to reproduce.

Anything else we need to know?:

Environment:

  • Kubernetes client and server versions (use kubectl version): v1.21.13 (client), v1.22.12 (server)
  • Cloud provider or hardware configuration: AWS EKS
  • OS (e.g: cat /etc/os-release): macOS 12.5.1

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 4
  • Comments: 22 (13 by maintainers)

Commits related to this issue

Most upvoted comments

@brianpursley: Those labels are not set on the issue: triage/(so, triage/it, triage/can, triage/be, triage/re-triaged, triage/by, triage/api, triage/machinery)

In response to this:

/remove-sig cli /remove-triage accepted (so it can be re-triaged by API Machinery)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Would be a very good feature. I’m getting frequent TLS timeouts from my k8s operator.