istio: [multi-cluster] Raise "HTTP/1.1 503 Service Unavailable" sometimes, when send request cross clusters.

Bug description [multi-cluster] Raise “HTTP/1.1 503 Service Unavailable” sometimes, when send request cross clusters.

  • If I do not call one service more than 1 minutes, then call one api of this service again (with servicename.namespacename.global), the fail rate is high: arrive 20 out of 1200.
  • If I call the api of this service continually, each 2 seconds (with servicename.namespacename.global), the fail is very low: only 6 out of 100000.

Environment: (istio1.4.6, replicate control panel, AWS, Using Public NLB as ingressgateway service)

Affected features (please put an X in all that apply)

[x] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm) istio 1.4.6

How was Istio installed? istio multi-cluster

Environment where bug was observed (cloud vendor, OS, etc) AWS EKS1.16


Pre-condition: Create two clusters, install with multi-cluster. Step 1: Create nginxclient in cluster1.

kubectl run --image=johnzheng/nginx nginxclient --generator=run-pod/v1

Step 2: Create nginx2 service in cluster2

kubectl run nginx2  --image=johnzheng/nginx --expose=true --port=80

Step 3: Create service entry for nginx2.default.global. Step 4: Test pass kubectl exec -ti nginxclient -- curl -i nginx2.default.global Step 5: Write script as below:

#!/bin/bash
for i in {1..200}; 
do
  date>> testresult.txt;
  kubectl exec -ti nginxclient -- curl -i nginx2.default.global/ >> testresult.txt;
  sleep 600;
done;

Test with nohup ./hi.sh > /dev/null 2>&1 & Step 6: Summarize the test result:

cat testresult.txt |grep "HTTP/1.1 200 OK"|wc -l   #Pass Count
cat testresult.txt |grep "503"|wc -l   #Fail Count

You will find fail rate may over 20%

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 34 (19 by maintainers)

Most upvoted comments

Any updates?

Hey, I think I found a solution, we can use httpRetry via Virtual Service. I have same error but with different use cases. This use cases is to automatically communicate mTLS with custom certificate (self signed CA / public CA). In the first cycle, I use this configuration to force using mTLS

---
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: <domain>-external-mtls-se
  namespace: <namespace>
spec:
  hosts:
  - <domain>
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS
  location: MESH_EXTERNAL
  exportTo:
  - "."
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: <domain>-external-mtls-dr
  namespace: <namespace>
spec:
  host: <domain>
  subsets:
  - name: tls-origination
    trafficPolicy:
      portLevelSettings:
      - port:
          number: 443
        tls:
          mode: MUTUAL
          clientCertificate: <path>/tls.crt
          privateKey: /<path>/tls.key
          caCertificates: <path>/ca.crt
          sni: <domain>
  exportTo:
  - "."
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: <domain>-external-mtls-vs
  namespace: <namespace>
spec:
  hosts:
  - <domain>
  http:
  - match:
    - port: 80
    route:
    - destination:
        host: <domain>
        subset: tls-origination
        port:
          number: 443
  exportTo:
  - "."

and adding certificate via annotation

    sidecar.istio.io/userVolumeMount: '[{"name":"<domain>-mtls-cert", "mountPath":"<path>", "readonly":true}]'
    sidecar.istio.io/userVolume: '[{"name":"<domain>-mtls-cert", "secret":{"secretName":"<domain>-mtls-cert"}}]

I analyze that it keep getting 503 connection termination/reset. I find out that we can use HTTPRetry to force retrying sending the packet. So the Virtual Service will look like this:

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: <domain>-external-mtls-vs
  namespace: <namespace>
spec:
  hosts:
  - <domain>
  http:
  - match:
    - port: 80
    route:
    - destination:
        host: <domain>
        subset: tls-origination
        port:
          number: 443
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx
  exportTo:
  - "."

Is there anyone else meet such issue? Can you give some work arround? Thanks.

b_cluster2

Tried the scripts above, still the same problem in EKS with istio 1.7. Can you tell me what version of coredns you are using?

 curl 172.25.70.123:8000/ip  -vv
*   Trying 172.25.70.123:8000...
* Connected to 172.25.70.123 (172.25.70.123) port 8000 (#0)
> GET /ip HTTP/1.1
> Host: 172.25.70.123:8000
> User-Agent: curl/7.69.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Fri, 11 Sep 2020 11:53:28 GMT
< server: envoy
< x-envoy-upstream-service-time: 0
<
* Connection #0 to host 172.25.70.123 left intact
upstream connect error or disconnect/reset before headers. reset reason: connection failure/ #

I am also facing the same issue on istio version 1.7. Same setup was working fine for istio 1.6. Curl command failure

curl -H "Host: httpbin.bar.global"  httpbin.bar.global:8000 -I
HTTP/1.1 503 Service Unavailable
content-length: 91
content-type: text/plain
date: Wed, 02 Sep 2020 15:25:48 GMT
server: envoy

Kube-dns configmap

apiVersion: v1
data:
  stubDomains: |
    {"global": ["10.100.141.243"]}
kind: ConfigMap
metadata:
  creationTimestamp: 2020-08-24T09:00:06Z
  name: kube-dns
  namespace: kube-system

Coredns configmap


apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
    global:53 {
        errors
        cache 30
        forward . 10.100.141.243:53
    }
kind: ConfigMap
metadata:
  labels:
    eks.amazonaws.com/component: coredns
    k8s-app: kube-dns
  name: coredns
  namespace: kube-system

Istiodns address

$ kubectl get svc -n istio-system istiocoredns -o jsonpath={.spec.clusterIP}
10.100.141.243

From the same cluster

root@nginx-pod:/# nslookup httpbin.bar.global
Server:		169.254.20.10
Address:	169.254.20.10#53

** server can't find httpbin.bar.global: NXDOMAIN

From other clusters

nslookup httpbin.bar.global
Server:		169.254.20.10
Address:	169.254.20.10#53

Name:	httpbin.bar.global
Address: 240.0.0.2
** server can't find httpbin.bar.global: NXDOMAIN

SE entry

apiVersion: v1
items:
- apiVersion: networking.istio.io/v1beta1
  kind: ServiceEntry
  metadata:
    generation: 2
    name: httpbin-bar
    namespace: foo
    resourceVersion: "948088"
    selfLink: /apis/networking.istio.io/v1beta1/namespaces/foo/serviceentries/httpbin-bar
    uid: 867a214e-2561-4a48-936d-0c41640dfafb
  spec:
    addresses:
    - 240.0.0.2
    endpoints:
    - address: a256e6f45146d41d5ac47dc62943fe15-c24bfa177ac58a99.elb.eu-west-1.amazonaws.com
      ports:
        http1: 15443
    hosts:
    - httpbin.bar.global
    location: MESH_INTERNAL
    ports:
    - name: http1
      number: 8000
      protocol: http
    resolution: DNS
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@stevenctl in my use case, we have multiple clusters. I modify .global and change that to .<cluster-name>. Here is where things get complicated.

cluster A -> cluster B cluster A -> cluster C cluster A -> cluster D

both cluster B, C, and D have the same service name and namespace. So from cluster A, I call each service like this:

service.namespace.cluster-b
service.namespace.cluster-c
service.namespace.cluster-d

and from cluster A, using this feature, we won’t load balance to multiple clusters. You can check this PR https://github.com/istio/istio/pull/25349 how we change .global to specific cluster name.

It turns out I had an issue with my initial coredns configuration, which I’ve resolved. With that fix, this all seems to be working for me on gke with both Istio 1.6 and 1.7.

I’ve re-worked my test environment so that I only deploy HelloWorld to one cluster, so that it’s more similar to the issue here. Here are the scripts I’m using:

  • Install Control Plane.
  • Install Services. This installs HelloWorld in cluster1 and the sleep service in both clusters. It also adds a ServiceEntry to each cluster. For the cluster that hosts HelloWorld, the SE refers to the service IP (rather than ingress). For the other cluster, it references the ingress.
  • Test calls the HelloWorld service from both clusters.

Example:

./install_control_plane.sh $CTX1
./install_control_plane.sh $CTX2
./install_services.sh $CTX1 $CTX2
./test.sh $CTX1 $CTX2

And the output:

Sending traffic from cluster gke_nathanmittler-istio_us-east1-b_cluster1
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Sending traffic from cluster gke_nathanmittler-istio_us-central1-b_cluster2
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq
Hello version: v1, instance: helloworld-v1-578dd69f69-t5kcq

Take a look and maybe give it a run to see if this works for you.

I think I may have replicated the issue with Istio 1.7 on GKE. It’s been a bit slow-going since I had to remember how all of this works 😃. Still digging into it.

We’re moving away from this method (.global) of configuring multi-primary. There are a lot of reasons for this (general complexity, odd load balancing behavior, loss of locality LB, etc.). The new installation instructions for 1.8 use a simpler approach that lets cluster.local work across the mesh in the same way that it does for primary-remote. Load balancing works properly in this model as well, since each control plane has all of the actual endpoints from every cluster and can load balance evenly across them.

I haven’t tried 1.6 yet, but I’ll do that next week (off for the long weekend).

@stevenctl do you recall any changes post 1.6 that may have broken the old .global configuration for multi-primary?