kubernetes: "don't require a load balancer between cluster and control plane and still be HA"

Can I connect kube-proxy to multiple api servers like:

—master=http://master1:8080,http://master2:8080,http://master3:8080 ?

About this issue

Original URL
State: open
Created 9 years ago
Reactions: 66
Comments: 121 (84 by maintainers)

Most upvoted comments

@thockin I got that, but some (many) cloud providers do not have an internal load balancer solution, in those cases one has to setup your own using something like haproxy or nginx, which adds a layer that it would not be necessary if we could specify the api servers as comma separated options.

+22

feniix on Jan 21, 2016

@mikedanese Note this is not a nice-to-have feature. It affects the high availability of a cluster since the kubelet and the kube-proxy are not able to talk to the replicated masters in an HA fashion. Hacks like the one suggested by @klausenbusk involving non-kube components are required to make things work properly.

My frustration stems from the fact that we have a change that attempts to fix this problem (#40674, pending since January 30!), but for whatever reasons nobody is willing to act on it, one way or another. As a contributor I’d be even more frustrated by this non-action.

+12

ovidiucp on Dec 15, 2017

It would also be great if the components could take a DNS name, and if the DNS name returns multiple A records, it would be parsed as OP described.

So if I had domain “master”, and 3 A records(192.168.0.1, 192.168.0.2, 192.168.0.3). Passing in:

--master=http://master:8080

Would be the same as

--master=http://192.168.0.1:8080,http://192.168.0.2:8080,http://192.168.0.3:8080

+10

elsonrodriguez on Jul 13, 2016

@mikedanese I think this is an area where we should enable the simple case of letting all Kubernetes components talk to a list of API Servers. Ideally users can take a different approach, but this feels like low hanging fruit. More advanced solutions outside of allow each component to take a list of API servers should be a much lower priority IMO.

+10

kelseyhightower on Jul 13, 2016

I am currently working on this. I am going to reuse PR abandoned by @mikedanese with some changes in pool implementation

dshulyak on Aug 11, 2016

a prototype of a client-go that supports HA with multiple apiservers https://github.com/aojea/client-go-multidialer

aojea on Jul 20, 2021

That is not really HA. How is failure detected? How long it takes to retry? How long is normal operation delayed?

It’s the same sort of HA that many LB implementations use, though. Failure is detected via a timeout (client controlled) or a TCP closed peer. All of these are needed anyway (it could be alive but slow or even deadlocked with the socket still open) or the apiserver process could crash mid-operation or the whole apiserver machine could crash.

thockin on Jun 5, 2023

Is this still the case in january 2021 ? We are working around this by using nginx per node , but it will simplify the infra if we can remove the nginx.

Sorry i copied it. 😂

mritd on Jan 21, 2021

@mikedanese It really is depressing that 2 years after this bug was opened no resolution exists. The inability to take action on this is just astonishing for an open source project, and I’ve worked in many large projects, including GCC back in the day, and quite a few while I was at Google.

Are you saying this bug is not high enough on the feature tracking list? Even though Kubernetes specifically states it is highly available? After 2 years?

On my bare metal cluster I currently use a hacked up solution in which different clients talk to different servers. If one of them dies for whatever reason, I lose 20% of my cluster because of this stupid problem. This is not a highly available configuration by any means.

Sorry for the rant, I’m starting to lose faith in Google’s engineering mojo.

ovidiucp on Dec 14, 2017

@ehashman the use-case is to take a load-balancer – which is not always an option in every deployment, or it become a single point of failure to avoid – out of the equation for kubelet to talk to the apiserver, especially for the masters. There is a very very real use-case in OpenShift land. Talk to networking and api people. We are heavily suffering from customers who bring their own load-balancer and it is not reliable. The cluster becomes unstable and it is very hard to debug as the LB at the customer side is a black-box, and the customer LB team says “the load-balancer is working”.

On an architecture level we have the spectrum from centralized control of apiserver routing via a load-balancer or the intelligent client that knows which node IP to talk to, watching apiserver /readyz and switching over when one node IP errors a lot.

sttts on Jun 24, 2021

Hi ! ok so since this old, but valid, were thinking best to close, and re approach it this way:

first file an issue to the client-go to support multiple apiservers

then file separate issues for kube-proxy, kubelet, to support this flag

Please don’t close an old ticket and pass the blame somewhere else! Rather open a ticket with client-go and reference this ticket there. I worked on Google’s infrastructure many moons ago, It’s embarrassing nobody there has made this a priority after these many years.

ovidiucp on Apr 9, 2021

since most people access the apiserver through a LB… is this still important to folks?

Yes.

Most people access it through a LB because there isn’t a choice to do otherwise…

kfox1111 on Apr 9, 2021

For now you should front the apiserver with a load balancer (e.g. gcplb/elb/nginx/haproxy).

One does not simply use a load balancer for HA, see:

if we put a reverse proxy in front of apiserver, we want the reverse proxy to be HA. And I do want to run kube-proxy via DaemonSet.

This seems natural to rely on the existing components. Anyway, in turn, HA proxies or load balancers require a reliable split-brain detection, which is the VIP shall never be running on several nodes. Therefore, here comes a Pacemaker with STONITH as well. As the heartbeat cannot prevent that corner case on its own. And here is again we have a huge operational overhead in the reference architecture. Please, don’t do that.

bogdando on Jul 5, 2016

It’s an illusion for kubelet to support multiple api-servers. In fact it simply picks up the first one to talk to.(which is kind ridiculous to me) Ref #19152

dalanlan on Dec 29, 2015

I think this needs to be as simple as possible. It’s primary purpose is for things that themselves co-implement the rest of services. Kubelet, kube-proxy, etc .

As such, any attempt to do client-side consensus seems like a step too far. KISS.

Hand-waving: the servers field holds strings which are used as the target of net.Dial, which can be hostnames or IPv4 or IPv6 addresses. At startup, pick one randomly. In case of disconnect use the same one or pick a new one. In case of a connection failure, cycle thru them one by one until one works or all have failed at which point you abort or wait-and-retry, whichever makes sense for the application.

DNS and dual-stack are handled by the net stack.

Do we really need anything more complicated than that?

thockin on Jun 4, 2023

. I agree that merits consideration but I don’t think it’s the only solution that does, and I wouldn’t give it more than a 60% chance of beating other options.

We should check the different solutions , experiment, compare and decide …

/retitle “don’t require a load balancer between cluster and control plane and still be HA”

Let me give this a shot /assign

aojea on Jun 24, 2021

From an architecture standpoint, I am not sure how kubelet talking to multiple API servers should even work. Are they from the same cluster?

Same cluster; I think the request may be better phrased as “don’t require a load balancer between cluster and control plane and still be HA”

Hey @lavalamp, checking if you were able to start with the KEP

“open to” means I’m willing to read one, not that I’m going to write one 😃

(It’s also not a promise to approve such a KEP!)

lavalamp on Jun 24, 2021

Does client-go support talking to multiple endpoints ?

it does not

Is this still the case in aug 2020 ? We are working around this by using haproxy per node , but it will simplify the infra if we can remove the haproxy

krmayankk on Aug 18, 2020

@yongtang using a per-worker-node nginx that is aware of multiple back-ends is an interesting idea. It would have to know of them - which an ELB gives you out of the box on a per-AutoScaling Group basis - but as you pointed out, not all that hard.

Still, it is wasteful (yak-shaving) that I need yet another component to get kube working, a component that, well, everyone else needs. Easily solved if the kubelet and kube-proxy just had it built in.

deitch on Aug 6, 2017

You’re right that it is not “completed”.

Re-skimming it, it seems the common thread is that kubeconfig could/should support multiple IPs. There are lots of caveats to think through, but “it’s just software, how hard can it be”. That said, it needs an owner.

thockin on Jun 1, 2023

alt-svc was rejected, but I don’t think there was any real objection to having multiple explicit IPs provided to kube-proxy? That said, kubelet’s --api-servers flag was removed in favor of a KubeConfig file and I don’t think that supports multiple IPs. So if we want to support multi-dial, we should support it for any kubeconfig.

thockin on Aug 19, 2022

To clarify: this issue is probably a placeholder for something we would solve in client-go rather than in these two special clients. We have the very same topic for kube-controller-manager and kube-scheduler.

sttts on Jun 24, 2021

Is this a problem that many people actually have? Yes - this otherwise offloads the work on to:

pre-existing load balancers (might be OK in a cloud environment, but more bare metal setups are now difficult to do)
tell kublet to talk to an LB on the host (haproxy etc) which then proxies to the API servers. This works, but you now need to orchestrate this list
have some sort of floating VIP that gets moved from api node to api node as life cycle events happen (requires a lot of additional work)

grahamhayes on Apr 22, 2021

/remove-sig cluster-lifecycle /sig node network

neolit123 on Sep 3, 2020

I’m sorry your experience has been challenging. Engineers across the community (well beyond Google) have identified loadbalancing built-in to the client as a nice-to-have as there is a well documented and generally available workaround by using a TCP loadbalancer. While possibilities have been discussed, consensus has not formed around any proposed solution. If this is something that is urgent for your team, please come join us to develop it. We are happy to accept contributions.

Google has intentionally opened Kubernetes to community governance which looks very different from a company led project. My previous comment was not meant to imply anything about the priority of the feature, only that it is not currently being tracked by the community established feature process documented here.

I/we hope you come participate.

mikedanese on Dec 14, 2017

Looking at the code, and spluking in to the golang HTTP infrastructure, it looks like if we set up a DNS name with multiple A records, that golang should resolve them all, and will try to connect to each of them in turn, and only fail if they all fail. So I believe that what @elsonrodriguez suggested is already the case.

For example, net.DIaler https://golang.org/src/net/dial.go#L24 has the following comment:

// When dialing a name with multiple IP addresses, the timeout
// may be divided between them.

and the Dial function calls dialSerial: https://golang.org/src/net/dial.go#L236

I also verified experimentally by killing all the apiservers in my cluster, and bringing them up one at a time. kubectl would fail if and only if all were down.

@andrewmichaelsmith would love to hear what you saw that was contrary to this (or if we have any other evidence that this isn’t true!)

Admittedly I only checked go 1.6.2, so this might not always have been the case:

Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.0", GitCommit:"283137936a498aed572ee22af6774b6fb6e9fd94", GitTreeState:"clean", BuildDate:"2016-07-01T19:26:38Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

justinsb on Jul 15, 2016

I think all the client needs to do is pick an endpoint at random when establishing a new connection, and retry with another one if it fails.

I don’t understand this whole discussion of quorums, timeouts, or connecting to all endpoints. The quorum is done by the control-plane itself, no one needs this to move to the client. The different endpoints might point to the same apiserver(s) anyway, if you have load-balancers in the middle.

Let’s stick with layer 4 load-balancing, which is how HA deployments work today (albeit with external load-balancers).

remram44 on Jun 4, 2023

FWIW, there’s that idea, that could allow keeping a single field: https://github.com/Jille/grpc-multi-resolver/blob/master/README.md

mcluseau on Jun 4, 2023

“open to” means I’m willing to read one, not that I’m going to write one 😃

@lavalamp if you want to read it 😃 https://github.com/kubernetes/enhancements/pull/3034

aojea on Nov 3, 2021

Sounds reasonable, I’m open to a proposal / KEP / PR.

lavalamp on Apr 22, 2021

/remove-good-first-issue This is not easy, we should not treat it as a good first issue. The goal is to solve the apiserver HA problem from the client so, instead of installing a LB in front of the apiserver, the client is able to choose the available apiserver. Connecting to the “kubernetes.default” service is not an option because that service depends on kubelet and kube-proxy. There can be different solutions: An overall solution, consisting in modify client-go to implement loadbalancer from the client, as grpc does, is clearly a sig-apimachinery decision. Modifying kubelet or kube-proxy to implement load balancing from the client, those components are owned by a different sig ,sig-node for kubelet and sig-network for kube-proxy, though it will be ideal that if there is a solution both components use the same approach.

aojea on Apr 18, 2021

On my bare metal cluster I currently use a hacked up solution in which different clients talk to different servers. If one of them dies for whatever reason, I lose 20% of my cluster because of this stupid problem. This is not a highly available configuration by any means.

We run nginx on every node and let it handle the load balancing, it works very well (idea stolen from kubespray). You should be able to do the same.

klausenbusk on Dec 14, 2017

https://github.com/kubernetes/kubernetes/issues/54306

Could support for kube-proxy snapshotting the default kubernetes service be a solution? Both kubelet and kube-proxy could point at the svc ip and iptables or ipvs could handle the actual load balancing? I think most of the code is already in place to support it?

It would complicate the initial node bootstrapping a bit, as you would have to place a checkpoint kube-proxy file and checkpoint kube service network description to get a node rolling for the first time. But not a huge lift? It would be self hosting from then on out.

kfox1111 on Nov 13, 2017

At least with ELBs, they do not seem to handle the long poll very well. I occasionally see kubelets/kube-proxies losing the event stream and not getting updates for a long period of time, followed by streamwatcher errors then a burst of updates. I’m not sure what the right approach is yet…

jsravn on Apr 5, 2017

FYI #30588

mikedanese on Aug 16, 2016

I think that relying on the Go resolver is OK to achieve this feature in a minimal manner.

But, ideally, if we document this then we should introspect the connection and tell users via a log line that you are connecting to “api-server.example [172.18.5.1]” vs “api-server.example [172.18.5.2]”. Otherwise, we are going to get into supportability issues and confusion on this one.

Aside: We hit stuff like this all of the time on etcd and have detection code logging the “full identity” of the node we are connecting to.

philips on Jul 20, 2016

@justinsb is right, with multiple DNS-A-records, it works just fine. However afaik (tested it a few months ago on version 1.1.2) you’re gonna get lots of error messages and a few problems as soon as one goes down. To work around this, I wrote a small script on the DNS-server which checks the availibilty of each api-server. If one goes down, it removes the corresponding DNS-entry. If it comes up again, it creates the entry again. Works really well on 3 API-servers on baremetal 😃

tommyknows on Jul 15, 2016