kubernetes: "don't require a load balancer between cluster and control plane and still be HA"
Can I connect kube-proxy to multiple api servers like:
—master=http://master1:8080,http://master2:8080,http://master3:8080
?
About this issue
- Original URL
- State: open
- Created 9 years ago
- Reactions: 66
- Comments: 121 (84 by maintainers)
@thockin I got that, but some (many) cloud providers do not have an internal load balancer solution, in those cases one has to setup your own using something like haproxy or nginx, which adds a layer that it would not be necessary if we could specify the api servers as comma separated options.
@mikedanese Note this is not a nice-to-have feature. It affects the high availability of a cluster since the kubelet and the kube-proxy are not able to talk to the replicated masters in an HA fashion. Hacks like the one suggested by @klausenbusk involving non-kube components are required to make things work properly.
My frustration stems from the fact that we have a change that attempts to fix this problem (#40674, pending since January 30!), but for whatever reasons nobody is willing to act on it, one way or another. As a contributor I’d be even more frustrated by this non-action.
It would also be great if the components could take a DNS name, and if the DNS name returns multiple A records, it would be parsed as OP described.
So if I had domain “master”, and 3 A records(192.168.0.1, 192.168.0.2, 192.168.0.3). Passing in:
--master=http://master:8080
Would be the same as
--master=http://192.168.0.1:8080,http://192.168.0.2:8080,http://192.168.0.3:8080
@mikedanese I think this is an area where we should enable the simple case of letting all Kubernetes components talk to a list of API Servers. Ideally users can take a different approach, but this feels like low hanging fruit. More advanced solutions outside of allow each component to take a list of API servers should be a much lower priority IMO.
I am currently working on this. I am going to reuse PR abandoned by @mikedanese with some changes in pool implementation
a prototype of a client-go that supports HA with multiple apiservers https://github.com/aojea/client-go-multidialer
It’s the same sort of HA that many LB implementations use, though. Failure is detected via a timeout (client controlled) or a TCP closed peer. All of these are needed anyway (it could be alive but slow or even deadlocked with the socket still open) or the apiserver process could crash mid-operation or the whole apiserver machine could crash.
Is this still the case in january 2021 ? We are working around this by using nginx per node , but it will simplify the infra if we can remove the nginx.
@mikedanese It really is depressing that 2 years after this bug was opened no resolution exists. The inability to take action on this is just astonishing for an open source project, and I’ve worked in many large projects, including GCC back in the day, and quite a few while I was at Google.
Are you saying this bug is not high enough on the feature tracking list? Even though Kubernetes specifically states it is highly available? After 2 years?
On my bare metal cluster I currently use a hacked up solution in which different clients talk to different servers. If one of them dies for whatever reason, I lose 20% of my cluster because of this stupid problem. This is not a highly available configuration by any means.
Sorry for the rant, I’m starting to lose faith in Google’s engineering mojo.
@ehashman the use-case is to take a load-balancer – which is not always an option in every deployment, or it become a single point of failure to avoid – out of the equation for kubelet to talk to the apiserver, especially for the masters. There is a very very real use-case in OpenShift land. Talk to networking and api people. We are heavily suffering from customers who bring their own load-balancer and it is not reliable. The cluster becomes unstable and it is very hard to debug as the LB at the customer side is a black-box, and the customer LB team says “the load-balancer is working”.
On an architecture level we have the spectrum from centralized control of apiserver routing via a load-balancer or the intelligent client that knows which node IP to talk to, watching apiserver
/readyz
and switching over when one node IP errors a lot.Please don’t close an old ticket and pass the blame somewhere else! Rather open a ticket with client-go and reference this ticket there. I worked on Google’s infrastructure many moons ago, It’s embarrassing nobody there has made this a priority after these many years.
Yes.
Most people access it through a LB because there isn’t a choice to do otherwise…
One does not simply use a load balancer for HA, see:
This seems natural to rely on the existing components. Anyway, in turn, HA proxies or load balancers require a reliable split-brain detection, which is the VIP shall never be running on several nodes. Therefore, here comes a Pacemaker with STONITH as well. As the heartbeat cannot prevent that corner case on its own. And here is again we have a huge operational overhead in the reference architecture. Please, don’t do that.
It’s an illusion for kubelet to support multiple api-servers. In fact it simply picks up the first one to talk to.(which is kind ridiculous to me) Ref #19152
I think this needs to be as simple as possible. It’s primary purpose is for things that themselves co-implement the rest of services. Kubelet, kube-proxy, etc .
As such, any attempt to do client-side consensus seems like a step too far. KISS.
Hand-waving: the
servers
field holds strings which are used as the target ofnet.Dial
, which can be hostnames or IPv4 or IPv6 addresses. At startup, pick one randomly. In case of disconnect use the same one or pick a new one. In case of a connection failure, cycle thru them one by one until one works or all have failed at which point you abort or wait-and-retry, whichever makes sense for the application.DNS and dual-stack are handled by the net stack.
Do we really need anything more complicated than that?
We should check the different solutions , experiment, compare and decide …
/retitle “don’t require a load balancer between cluster and control plane and still be HA”
Let me give this a shot /assign
Same cluster; I think the request may be better phrased as “don’t require a load balancer between cluster and control plane and still be HA”
“open to” means I’m willing to read one, not that I’m going to write one 😃
(It’s also not a promise to approve such a KEP!)
Is this still the case in aug 2020 ? We are working around this by using haproxy per node , but it will simplify the infra if we can remove the haproxy
@yongtang using a per-worker-node nginx that is aware of multiple back-ends is an interesting idea. It would have to know of them - which an ELB gives you out of the box on a per-AutoScaling Group basis - but as you pointed out, not all that hard.
Still, it is wasteful (yak-shaving) that I need yet another component to get kube working, a component that, well, everyone else needs. Easily solved if the kubelet and kube-proxy just had it built in.
You’re right that it is not “completed”.
Re-skimming it, it seems the common thread is that kubeconfig could/should support multiple IPs. There are lots of caveats to think through, but “it’s just software, how hard can it be”. That said, it needs an owner.
alt-svc was rejected, but I don’t think there was any real objection to having multiple explicit IPs provided to kube-proxy? That said, kubelet’s
--api-servers
flag was removed in favor of a KubeConfig file and I don’t think that supports multiple IPs. So if we want to support multi-dial, we should support it for any kubeconfig.To clarify: this issue is probably a placeholder for something we would solve in client-go rather than in these two special clients. We have the very same topic for kube-controller-manager and kube-scheduler.
/remove-sig cluster-lifecycle /sig node network
I’m sorry your experience has been challenging. Engineers across the community (well beyond Google) have identified loadbalancing built-in to the client as a nice-to-have as there is a well documented and generally available workaround by using a TCP loadbalancer. While possibilities have been discussed, consensus has not formed around any proposed solution. If this is something that is urgent for your team, please come join us to develop it. We are happy to accept contributions.
Google has intentionally opened Kubernetes to community governance which looks very different from a company led project. My previous comment was not meant to imply anything about the priority of the feature, only that it is not currently being tracked by the community established feature process documented here.
I/we hope you come participate.
Looking at the code, and spluking in to the golang HTTP infrastructure, it looks like if we set up a DNS name with multiple A records, that golang should resolve them all, and will try to connect to each of them in turn, and only fail if they all fail. So I believe that what @elsonrodriguez suggested is already the case.
For example, net.DIaler https://golang.org/src/net/dial.go#L24 has the following comment:
and the Dial function calls dialSerial: https://golang.org/src/net/dial.go#L236
I also verified experimentally by killing all the apiservers in my cluster, and bringing them up one at a time. kubectl would fail if and only if all were down.
@andrewmichaelsmith would love to hear what you saw that was contrary to this (or if we have any other evidence that this isn’t true!)
Admittedly I only checked go 1.6.2, so this might not always have been the case:
I think all the client needs to do is pick an endpoint at random when establishing a new connection, and retry with another one if it fails.
I don’t understand this whole discussion of quorums, timeouts, or connecting to all endpoints. The quorum is done by the control-plane itself, no one needs this to move to the client. The different endpoints might point to the same apiserver(s) anyway, if you have load-balancers in the middle.
Let’s stick with layer 4 load-balancing, which is how HA deployments work today (albeit with external load-balancers).
FWIW, there’s that idea, that could allow keeping a single field: https://github.com/Jille/grpc-multi-resolver/blob/master/README.md
@lavalamp if you want to read it 😃 https://github.com/kubernetes/enhancements/pull/3034
Sounds reasonable, I’m open to a proposal / KEP / PR.
/remove-good-first-issue This is not easy, we should not treat it as a good first issue. The goal is to solve the apiserver HA problem from the client so, instead of installing a LB in front of the apiserver, the client is able to choose the available apiserver. Connecting to the “kubernetes.default” service is not an option because that service depends on kubelet and kube-proxy. There can be different solutions: An overall solution, consisting in modify client-go to implement loadbalancer from the client, as grpc does, is clearly a sig-apimachinery decision. Modifying kubelet or kube-proxy to implement load balancing from the client, those components are owned by a different sig ,sig-node for kubelet and sig-network for kube-proxy, though it will be ideal that if there is a solution both components use the same approach.
We run nginx on every node and let it handle the load balancing, it works very well (idea stolen from kubespray). You should be able to do the same.
https://github.com/kubernetes/kubernetes/issues/54306
Could support for kube-proxy snapshotting the default kubernetes service be a solution? Both kubelet and kube-proxy could point at the svc ip and iptables or ipvs could handle the actual load balancing? I think most of the code is already in place to support it?
It would complicate the initial node bootstrapping a bit, as you would have to place a checkpoint kube-proxy file and checkpoint kube service network description to get a node rolling for the first time. But not a huge lift? It would be self hosting from then on out.
At least with ELBs, they do not seem to handle the long poll very well. I occasionally see kubelets/kube-proxies losing the event stream and not getting updates for a long period of time, followed by streamwatcher errors then a burst of updates. I’m not sure what the right approach is yet…
FYI #30588
I think that relying on the Go resolver is OK to achieve this feature in a minimal manner.
But, ideally, if we document this then we should introspect the connection and tell users via a log line that you are connecting to “api-server.example [172.18.5.1]” vs “api-server.example [172.18.5.2]”. Otherwise, we are going to get into supportability issues and confusion on this one.
Aside: We hit stuff like this all of the time on etcd and have detection code logging the “full identity” of the node we are connecting to.
@justinsb is right, with multiple DNS-A-records, it works just fine. However afaik (tested it a few months ago on version 1.1.2) you’re gonna get lots of error messages and a few problems as soon as one goes down. To work around this, I wrote a small script on the DNS-server which checks the availibilty of each api-server. If one goes down, it removes the corresponding DNS-entry. If it comes up again, it creates the entry again. Works really well on 3 API-servers on baremetal 😃