istio: [multi-cluster] Raise "HTTP/1.1 503 Service Unavailable" sometimes, when send request cross clusters.
Bug description [multi-cluster] Raise “HTTP/1.1 503 Service Unavailable” sometimes, when send request cross clusters.
- If I do not call one service more than 1 minutes, then call one api of this service again (with servicename.namespacename.global), the fail rate is high: arrive 20 out of 1200.
- If I call the api of this service continually, each 2 seconds (with servicename.namespacename.global), the fail is very low: only 6 out of 100000.
Environment: (istio1.4.6, replicate control panel, AWS, Using Public NLB as ingressgateway service)
Affected features (please put an X in all that apply)
[x] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
istio 1.4.6
How was Istio installed? istio multi-cluster
Environment where bug was observed (cloud vendor, OS, etc) AWS EKS1.16
Pre-condition: Create two clusters, install with multi-cluster. Step 1: Create nginxclient in cluster1.
kubectl run --image=johnzheng/nginx nginxclient --generator=run-pod/v1
Step 2: Create nginx2 service in cluster2
kubectl run nginx2 --image=johnzheng/nginx --expose=true --port=80
Step 3: Create service entry for nginx2.default.global.
Step 4: Test pass kubectl exec -ti nginxclient -- curl -i nginx2.default.global
Step 5: Write script as below:
#!/bin/bash
for i in {1..200};
do
date>> testresult.txt;
kubectl exec -ti nginxclient -- curl -i nginx2.default.global/ >> testresult.txt;
sleep 600;
done;
Test with nohup ./hi.sh > /dev/null 2>&1 &
Step 6: Summarize the test result:
cat testresult.txt |grep "HTTP/1.1 200 OK"|wc -l #Pass Count
cat testresult.txt |grep "503"|wc -l #Fail Count
You will find fail rate may over 20%
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 34 (19 by maintainers)
Any updates?
Hey, I think I found a solution, we can use
httpRetry
via Virtual Service. I have same error but with different use cases. This use cases is to automatically communicate mTLS with custom certificate (self signed CA / public CA). In the first cycle, I use this configuration to force using mTLSand adding certificate via annotation
I analyze that it keep getting 503 connection termination/reset. I find out that we can use
HTTPRetry
to force retrying sending the packet. So the Virtual Service will look like this:Is there anyone else meet such issue? Can you give some work arround? Thanks.
Tried the scripts above, still the same problem in EKS with istio 1.7. Can you tell me what version of
coredns
you are using?I am also facing the same issue on istio version 1.7. Same setup was working fine for istio 1.6. Curl command failure
Kube-dns configmap
Coredns configmap
Istiodns address
From the same cluster
From other clusters
SE entry
@stevenctl in my use case, we have multiple clusters. I modify
.global
and change that to.<cluster-name>
. Here is where things get complicated.cluster A -> cluster B cluster A -> cluster C cluster A -> cluster D
both cluster B, C, and D have the same service name and namespace. So from cluster A, I call each service like this:
and from cluster A, using this feature, we won’t load balance to multiple clusters. You can check this PR https://github.com/istio/istio/pull/25349 how we change
.global
to specific cluster name.It turns out I had an issue with my initial coredns configuration, which I’ve resolved. With that fix, this all seems to be working for me on gke with both Istio 1.6 and 1.7.
I’ve re-worked my test environment so that I only deploy HelloWorld to one cluster, so that it’s more similar to the issue here. Here are the scripts I’m using:
Example:
And the output:
Take a look and maybe give it a run to see if this works for you.
I think I may have replicated the issue with Istio 1.7 on GKE. It’s been a bit slow-going since I had to remember how all of this works 😃. Still digging into it.
We’re moving away from this method (.global) of configuring multi-primary. There are a lot of reasons for this (general complexity, odd load balancing behavior, loss of locality LB, etc.). The new installation instructions for 1.8 use a simpler approach that lets
cluster.local
work across the mesh in the same way that it does for primary-remote. Load balancing works properly in this model as well, since each control plane has all of the actual endpoints from every cluster and can load balance evenly across them.I haven’t tried 1.6 yet, but I’ll do that next week (off for the long weekend).
@stevenctl do you recall any changes post 1.6 that may have broken the old
.global
configuration for multi-primary?