istio: 503 NR's during load testing canary deployment

Bug description I see 503 NR’s during load testing canary deployment using Agro Rollouts. I’ll describe min setup with steps to reproduce the issue, reflecting the same sequence of actions Agro Rollouts does during canary deployment.

[ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior No downtime during canary deployment using Agro Rollouts.

Steps to reproduce the bug

  1. Create a cluster in GKE and install Istio using istioctl (exact command is below).

  2. kubectl apply resources:

  • Gateway:
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: example-ingress
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*.example'
    port:
      name: http
      number: 80
      protocol: HTTP
  • Stable deployment (reviews app v1):
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v1
  labels:
    app: reviews
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v1
  template:
    metadata:
      labels:
        app: reviews
        version: v1
    spec:
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v1:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}
  • Canary deployment (reviews app v2):
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v2
  labels:
    app: reviews
    version: v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v2
  template:
    metadata:
      labels:
        app: reviews
        version: v2
    spec:
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v2:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}
  • Stable deployment service (aimed to v1 deployment):
apiVersion: v1
kind: Service
metadata:
  name: reviews-stable
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews
    version: v1
  • Canary deployment service (aimed to v1 as well at the very beginning):
apiVersion: v1
kind: Service
metadata:
  name: reviews-canary
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews
    version: v1
  • Virtual Service definition (100% of traffic goes to stable deployment service):
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews-vsvc
spec:
  gateways:
  - istio-system/example-ingress
  hosts:
  - reviews.example
  http:
  - name: primary
    route:
    - destination:
        host: reviews-stable
      weight: 100
    - destination:
        host: reviews-canary
      weight: 0
  1. Execute JMeter load test against review.example:
...
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="HTTP Request" enabled="true">
          <elementProp name="HTTPsampler.Arguments" elementType="Arguments" guiclass="HTTPArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
            <collectionProp name="Arguments.arguments"/>
          </elementProp>
          <stringProp name="HTTPSampler.domain">reviews.example</stringProp>
          <stringProp name="HTTPSampler.port"></stringProp>
          <stringProp name="HTTPSampler.protocol">http</stringProp>
          <stringProp name="HTTPSampler.contentEncoding"></stringProp>
          <stringProp name="HTTPSampler.path">/health</stringProp>
          <stringProp name="HTTPSampler.method">GET</stringProp>
          <boolProp name="HTTPSampler.follow_redirects">true</boolProp>
          <boolProp name="HTTPSampler.auto_redirects">false</boolProp>
          <boolProp name="HTTPSampler.use_keepalive">true</boolProp>
          <boolProp name="HTTPSampler.DO_MULTIPART_POST">false</boolProp>
          <stringProp name="HTTPSampler.embedded_url_re"></stringProp>
          <stringProp name="HTTPSampler.proxyHost">10.20.30.40</stringProp>
          <stringProp name="HTTPSampler.proxyPort">80</stringProp>
          <stringProp name="HTTPSampler.connect_timeout"></stringProp>
          <stringProp name="HTTPSampler.response_timeout"></stringProp>
        </HTTPSamplerProxy>
...
  1. Aim canary deployment service to Deployment v2:
curl -X PATCH -H "Content-Type: application/merge-patch+json" --data '{"spec": {"selector": {"version": "v2"}}}' http://localhost:8080/api/v1/namespaces/default/services/reviews-canary
  1. Observe some downtime:
$ tail results -f | grep -Ev '(200,OK)'
1598539426258,165,HTTP Request,503,Service Unavailable,Thread Group 1-3,,false,,148,177,5,5,http://reviews.example/health,165,0,28
1598539426256,184,HTTP Request,503,Service Unavailable,Thread Group 1-4,,false,,149,177,5,5,http://reviews.example/health,184,0,29

All traffic goes through stable deployment service, but it was not touched at all. Please help to explain downtime.

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$ istioctl version --remote
client version: 1.7.0
control plane version: 1.7.0
data plane version: 1.7.0 (5 proxies)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-gke.1", GitCommit:"688c6543aa4b285355723f100302d80431e411cc", GitTreeState:"clean", BuildDate:"2020-07-21T02:37:26Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}

How was Istio installed?

istioctl install --set profile=demo
kubectl label namespace default istio-injection=enabled

Environment where bug was observed (cloud vendor, OS, etc) GKE, 1.16.13-gke.1

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 29 (20 by maintainers)

Commits related to this issue

Most upvoted comments

@pliutak-nih not at this time. I have identified the root cause and am trying to get confirmation from the Envoy team whether this is an Envoy bug or if there is something we can do on Istio side.

This is highly unlikely to land in 1.7 as it required substantial changes to the telemetry code in order to work, (which is why it has not been merged - should be soon though)

@ngms06 the issue isn’t so much as delay its that we are swapping something non atomically.

Before: listener points to cluster v1, we have clusters [v1] - all good

Intermediate state (should be a couple ms): listener points to cluster v1, we have clusters [v2] - broken

After: listener points to cluster v2, we have clusters [v2] - all good

@DmitryKiselev in the short term you can use VS to do this (see https://medium.com/infinite-lambda/canary-and-blue-green-deployments-with-helm-and-istio-4139886447b6 for example). You can also make sure the Service is created in the “right” order (see comment in https://github.com/istio/istio/issues/26861#issuecomment-686840121) - this is a HUGE hack though, just thinking short term mitigation.

My PR is not yet merged but I am hoping to discuss this issue in the networking WG meeting in 2 days to get a wider audience for some ideas.

If the server pod shows 503 NR its very likely the same. If its something else, may be different

Here is what is happening. When both the Services point to the same pod, we get into a conflict, because Istio needs to setup inbound configuration and we have two different services. When this happens, conflict resolution picks the oldest service first. In our case, either canary was created first or it was created at exact same time at which point we resolve by alphabetical - canary comes before stable. This means every time we switch the service, we are now switching which service “wins”. This causes some churn in envoy, which is somehow causing the issue.

Full reproducing config:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: example-ingress
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v1
  labels:
    app: reviews
    version: v1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v1
  template:
    metadata:
      labels:
        app: reviews
        version: v1
    spec:
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v1:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: reviews-v2
  labels:
    app: reviews
    version: v2
spec:
  replicas: 1
  selector:
    matchLabels:
      app: reviews
      version: v2
  template:
    metadata:
      labels:
        app: reviews
        version: v2
    spec:
      containers:
      - name: reviews
        image: docker.io/istio/examples-bookinfo-reviews-v2:1.16.2
        imagePullPolicy: IfNotPresent
        env:
        - name: LOG_DIR
          value: "/tmp/logs"
        ports:
        - containerPort: 9080
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: wlp-output
          mountPath: /opt/ibm/wlp/output
      volumes:
      - name: wlp-output
        emptyDir: {}
      - name: tmp
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: reviews-stable
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews
    version: v1
---
apiVersion: v1
kind: Service
metadata:
  name: reviews-canary
spec:
  ports:
  - port: 9080
    name: http
  selector:
    app: reviews
    version: v1
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews-vsvc
spec:
  gateways:
  - istio-system/example-ingress
  hosts:
  - reviews
  http:
  - name: primary
    retries:
      attempts: 0
    route:
    - destination:
        host: reviews-stable
      weight: 100
    - destination:
        host: reviews-canary
      weight: 0

Then just fortio load -qps 40 -t 0s -H "Host: reviews.default.svc.cluster.local" IP/health

Access logs on inbound sidecar:

[2020-09-04T01:03:17.418Z] "GET /health HTTP/1.1" 503 NR "-" "-" 0 0 0 - "10.128.15.218" "fortio.org/fortio-1.3.1" "73799c60-de7b-41f7-b51f-0260c576123d" "reviews.default.svc.cluster.local" "-" - - 10.28.3.75:9080 10.128.15.218:0 outbound_.9080_._.reviews-stable.default.svc.cluster.local default
for i in {0..1000000}; do kubectl patch svc reviews-canary --patch '{"spec": {"selector": {"version": "v1"}}}'; sleep 2; kubectl patch svc reviews-canary --patch '{"spec": {"selector": {"version": "v2"}}}'; sleep 2; done

Hi John, thanks for looking into it.

I see that you’re trying to patch reviews-stable. Please note, I’m patching reviews-canary on step 4 (it’s exactly, what argo-rollouts is doing prior to switch traffic to canary).