istio: Multi-Cluster/Multi-Network - Cannot use a hostname-based gateway for east-west traffic
Bug description
Following the guide Install Multi-Primary on different networks, everything seems to install as expected without errors and is running in the cluster. For secret/cacerts
I am using the example certificate material from samples/certs/*.pem in both cluster1 and cluster2.
When I attempt to verify the installation using the guide Verify the installation the requests are not getting routed to the remote cluster as expected. I am only getting responses from the service on the local cluster:
# From Cluster 1 [where helloworld v1 is deployed]
$ while true; do kubectl exec --context="${CTX_CLUSTER1}" -n sample -c sleep "$(kubectl get pod --context="${CTX_CLUSTER1}" -n sample -l \
app=sleep -o jsonpath='{.items[0].metadata.name}')" -- curl -s helloworld.sample:5000/hello; done
Hello version: v1, instance: helloworld-v1-578dd69f69-r9lkz
Hello version: v1, instance: helloworld-v1-578dd69f69-r9lkz
Hello version: v1, instance: helloworld-v1-578dd69f69-r9lkz
Hello version: v1, instance: helloworld-v1-578dd69f69-r9lkz
Hello version: v1, instance: helloworld-v1-578dd69f69-r9lkz
...
# From Cluster 2 [where helloworld v2 is deployed]
$ while true; do kubectl exec --context="${CTX_CLUSTER2}" -n sample -c sleep "$(kubectl get pod --context="${CTX_CLUSTER2}" -n sample -l \
app=sleep -o jsonpath='{.items[0].metadata.name}')" -- curl -s helloworld.sample:5000/hello; done
Hello version: v2, instance: helloworld-v2-776f74c475-h5j2q
Hello version: v2, instance: helloworld-v2-776f74c475-h5j2q
Hello version: v2, instance: helloworld-v2-776f74c475-h5j2q
Hello version: v2, instance: helloworld-v2-776f74c475-h5j2q
Hello version: v2, instance: helloworld-v2-776f74c475-h5j2q
...
istioctl proxy-config endpoint
for the sleep
pod in cluster1 and cluster2 to helloworld
destination service:
# Cluster 1
$ istioctl -n sample --context=${CTX_CLUSTER1} proxy-config endpoint "$(kubectl get pod --context="${CTX_CLUSTER1}" -n sample -l app=sleep -o jsonpath='{.items[0].metadata.name}')" | grep helloworld
10.100.1.12:5000 HEALTHY OK outbound|5000||helloworld.sample.svc.cluster.local
# Cluster 2
$ istioctl -n sample --context=${CTX_CLUSTER2} proxy-config endpoint "$(kubectl get pod --context="${CTX_CLUSTER2}" -n sample -l app=sleep -o jsonpath='{.items[0].metadata.name}')" | grep helloworld
10.100.2.188:5000 HEALTHY OK outbound|5000||helloworld.sample.svc.cluster.local
It seems like maybe something is missing from the docs or example configs, but I understand that there are tests for the docs/examples, which is why I’ve been troubleshooting my own cluster… but just seems like something small is missing
[X] Docs [X] Installation [X] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [ ] Upgrade
Expected behavior I expected to be able to follow the guide and get the same behavior that the guide expects
Steps to reproduce the bug Following the guides for multi-primary, multi-network install and verify the install
Version (include the output of istioctl version --remote
and kubectl version --short
and helm version --short
if you used Helm)
$ istioctl version
client version: 1.8.0
control plane version: 1.8.0
data plane version: 1.8.0 (4 proxies)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
How was Istio installed?
Istio Operator installed with istioctl operator init
and the rest of the installation of istio in istio-system
is done as steps in the guide [using example manifests & scripts to compile example manifests]
Environment where the bug was observed (cloud vendor, OS, etc) AWS EKS with Kubernetes v1.17 Istio 1.8.0 on Mac
$ uname -a
Darwin OAK-MAC-HLJHD2 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_6
Additionally, please consider running istioctl bug-report
and attach the generated cluster-state tarball to this issue.
Refer cluster state archive for more details.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 20
- Comments: 83 (45 by maintainers)
I fix this problem adding temporal elbs ips in ConfigMap this part:
k edit ConfigMap -n istio-system
Later, You need kill istiod pod an reaload the config
This working for me but I wait the better solution and the feature request for support cname in gateways address , because the loadbalancers in AWS use a Elastic_IPs/dynamic ips!
Feature Request: +1
@nmittler @sonnysideup @stevenctl … Got it working by manually defining
meshNetworks
so hopefully that helps confirm what is needed. I am surprised nobody else using AWS EKS and Istio 1.8 have run into this issue. Let me know if there’s any other details I can provide which would help. Thanks for your assistance in the meantime!Here’s my notes from the test [all these resources have been destroy already but can be replicated easily]:
Get Cluster 1 eastwestgateway Host/IP:
Get Cluster 2 eastwestgateway Host/IP:
Desired change to
data.meshNetworks
:Define Cluster 1’s
configmap/istio
data.meshNetworks
manuallyDefine Cluster 2’s
configmap/istio
data.meshNetworks
manuallyistiod logs confirming configmap changes are picked up:
It works - we see responses from v1 and v2!
This may be the issue. If you
kubectl --context="${CTX_CLUSTER1}" get svc istio-eastwestgateway -n istio-system -oyaml
is there a IP address at eitherstatus.LoadBalancer.Ingress
or underspec.ExternalIPs
? Those are the only two address types we allow for auto-gateway discovery (via thattopology.istio.io/network
label on the Service).If you know the IP you may be able to use a legacy type of configuration
meshNetworks
to manually specify the addresses to use for the gateways:This would be included in the install operator for all clusters and would need to be identical in every cluster in the mesh.
More info: https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#MeshNetworks
Internal and yes they won’t change. We manage creation of those NLBs via ALB ingress controller. We just run this job in cron so that if in any case we need to delete/recreate this infra we don’t want to be in the business of manually updating these configmaps and service entries. The Job only updates if there are any changes to the IPs.
It can be a simple k8s job as opposed to cron as well.
@bryankaraffa I’m glad you got this working. I think we definitely need to do:
https://istio.io/latest/docs/reference/config/istio.mesh.v1alpha1/#Network-IstioNetworkGateway
The
address
field should support an externally resolvable hostname. I think we should be able to support auto-discovering the LoadBalancer.hostname field as well, I’ll open a separate issue for adding that and hopefully it will be available in a future release.For the time being, we should add a section to the doc explaining this alternative config when there isn’t an IP.
@tr-srij you can share you script? I written a script in goland by I need more time for testing.
I dont know the capability in AWS, hence this question on Azure AKS, LB type service support statically assigning IP (both private and public IP) to the Loadbalancer type Kubernetes service using property: loadBalancerIP: Does AWS support a static IP on LB type service used for EW gateway ?
@nmittler – confirming there are no IPs under
status.LoadBalancer.Ingress
orspec.ExternalIPs
This seems similar/related to what I ran into with multi-cluster setup on Istio 1.7 on AWS EKS, on this step where you get the remote cluster ingress hostname… The docs suggested using
-o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}'
but I had to use-o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}'
to get a valid value for theServiceEntry
in next stepSame on my current cluster as well:
@rinormaloku FYI the reason the Service must exist in both clusters is so that the Service’s hostname resolves to some IP just to get the request out of the client workload and to its sidecar proxy.
@sonnysideup @bryankaraffa
The cluster name in the
context
field of the remote secret is not what istiod will use for the cluster name. Rather the data key under string data should be verified. These are likely the same, but worth checking. Do the clusterIDs in your logs match what you have in your istio operator config? Or are those generated names that might be assumed from your local kubeconfig file?Also curious what happens if you restart istiod.
Here is my workaround for the problem as a shell script running in a
CronJob
: https://github.com/markszabo/istio-crosscluster-workaround-for-eks Any feedback is welcome!I’m experiencing this same issue
@markszabo not sure what’s up, but your cert still seem borked and chrome is rejecting. I downloaded the cert with openssl (visible here) and it looks pretty borked.
Back to the matter at hand though – I’m not sure if any one else has hit this when applying the workaround to a multi primary deployment, but I allocated EIP IPs to two NLBs here and updated the associated config, and when
proxy status endpoints
is borked for the sample workloads and I see the envoy errorLbEndpointValidationError.LoadBalancingWeight: value must be greater than or equal to 1
in the istiod logs (here from west):Fwiw,
kubectl get cm -n istio-system istio -ojsonpath='{.data.meshNetworks}'
:I run into this issue too, and ended up writing down the workaround described above: https://szabo.jp/2021/09/22/multicluster-istio-on-eks/
@carnei-ro that’s a great suggestion – I’ve put it in this doc and hopefully we can prioritize the impl soon
https://docs.google.com/document/d/1Sbg6hyO9NAOagtHxsg6H-OlQKoTpy8zaXVEphlj5R-M/edit?usp=sharing&resourcekey=0-Aqzm_-tOzxlN46Qijpf7cw
By using the TTL it would be possible to use ELB Classic (the CNAME points to “ephemerals” IPs)
I’m running in to the same issue. I followed the installation guide, but unfortunately it did not work. I’m using kubernetes-kind. Is it possible to achieve communication between clusters with kind and istio eastwest-gateway?
Although if it’s static, users could also just resolve the IP themselves, and set
externalIPs
on their gateway Service, which doesn’t hide the underlying issue.After doing a bit of research in envoy issues/docs, confirmed that isn’t a way to get DNS resolution with EDS clusters in envoy… the two options I’ve been able to find are:
It might make sense to have a feature-flag to re-enable the eager DNS resolution in the short term, for users who are confident that the IP is static.
@bryankaraffa I got this working somewhat in AWS but let me re-describe my full setup for context:
1. Allocate NLBs for Gateways
I configured the
serviceAnnotations
section of my eastwest-gateway manifest to create an NLB and associate my pre-allocated EIPs. After a brief period of time, the NLBs became active and the target groups are registered as “healthy”.2. Update
istio.istio-system
ConfigMapManually updating the
meshNetworks
section (https://github.com/istio/istio/issues/29359#issuecomment-738970767) of these configmaps is still required. What’s different here is that the EIPs are stable and will NOT change over time. This appears to be a reasonable setup for a production deployment and should work until support for ELB hostnames becomes available.@bryankaraffa Looks like that validation added in https://github.com/istio/istio/pull/23311 conflicts with other logic we have in pilot. It does still seem risky to use a hostname here the way hostname support is currently implemented (eagerly resolving DNS rather than resolving it at the proxy).
I followed the procedure exactly as described in the 1.8 multi primary multi network and it worked without any issues. I was able to successfully complete the multicluster verification as described here --> https://istio.io/latest/docs/setup/install/multicluster/verify/ Matter of fact, the whole procedure worked perfect the very first time and under an hour I was up and running with multi cluster.
Few things I found out during the procedure (they are documented well in the instructions)
You have to create stub namespace and stub service in cluster1 (same namespace and same service name) to access the service in cluster2. This is well documented in the procedure. But just pointing out that this is an important step. So is the remote secret. Cluster1 should be able to access the kubeapi on clsuter2 and vice versa. If you are running hosted Kubernetes make sure your Kube api is accessible from both clusters. If you have IP ACLs on kube api, make sure you allow access from the other cluster. Also cluster 1 should be able to access the east west gateway on cluster2 and vice versa. Once again IP ACLS or Network security group etc need to allow this.
Yeah, I created the service in both the primary and remote clusters, and I deployed the actual workload to the remote cluster. I’m not able to address the desired service without creating it inside the primary; I guess service mirroring is not supported. Whenever you do that, do you see endpoints created inside your primary cluster?
After I create the helloworld app inside the remote cluster, I see the following logs from istiod inside the primary cluster:
This looks suspicous and makes me think something is awry otherwise I’d end up with endpoints being created inside the primary cluster, no?
Unfortunately, that step has already been performed. 😢
Nothing else readily comes to mind regarding setup steps that I may have missed. I can see that the remote helloworld deployment has been sync’d inside the primary cluster
and the sleep pod inside the primary cluster where I’m performing verification is also up-to-date.
The only option I can think of right now is to enable DEBUG logging for istiod / istio-proxy and see if anything pops up there.