istio: 503 NR's during load testing canary deployment
Bug description I see 503 NR’s during load testing canary deployment using Agro Rollouts. I’ll describe min setup with steps to reproduce the issue, reflecting the same sequence of actions Agro Rollouts does during canary deployment.
[ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior No downtime during canary deployment using Agro Rollouts.
Steps to reproduce the bug
-
Create a cluster in GKE and install Istio using istioctl (exact command is below).
-
kubectl apply
resources:
- Gateway:
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: example-ingress
namespace: istio-system
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- '*.example'
port:
name: http
number: 80
protocol: HTTP
- Stable deployment (
reviews
app v1):
apiVersion: apps/v1
kind: Deployment
metadata:
name: reviews-v1
labels:
app: reviews
version: v1
spec:
replicas: 1
selector:
matchLabels:
app: reviews
version: v1
template:
metadata:
labels:
app: reviews
version: v1
spec:
containers:
- name: reviews
image: docker.io/istio/examples-bookinfo-reviews-v1:1.16.2
imagePullPolicy: IfNotPresent
env:
- name: LOG_DIR
value: "/tmp/logs"
ports:
- containerPort: 9080
volumeMounts:
- name: tmp
mountPath: /tmp
- name: wlp-output
mountPath: /opt/ibm/wlp/output
volumes:
- name: wlp-output
emptyDir: {}
- name: tmp
emptyDir: {}
- Canary deployment (
reviews
app v2):
apiVersion: apps/v1
kind: Deployment
metadata:
name: reviews-v2
labels:
app: reviews
version: v2
spec:
replicas: 1
selector:
matchLabels:
app: reviews
version: v2
template:
metadata:
labels:
app: reviews
version: v2
spec:
containers:
- name: reviews
image: docker.io/istio/examples-bookinfo-reviews-v2:1.16.2
imagePullPolicy: IfNotPresent
env:
- name: LOG_DIR
value: "/tmp/logs"
ports:
- containerPort: 9080
volumeMounts:
- name: tmp
mountPath: /tmp
- name: wlp-output
mountPath: /opt/ibm/wlp/output
volumes:
- name: wlp-output
emptyDir: {}
- name: tmp
emptyDir: {}
- Stable deployment service (aimed to v1 deployment):
apiVersion: v1
kind: Service
metadata:
name: reviews-stable
spec:
ports:
- port: 9080
name: http
selector:
app: reviews
version: v1
- Canary deployment service (aimed to v1 as well at the very beginning):
apiVersion: v1
kind: Service
metadata:
name: reviews-canary
spec:
ports:
- port: 9080
name: http
selector:
app: reviews
version: v1
- Virtual Service definition (100% of traffic goes to stable deployment service):
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews-vsvc
spec:
gateways:
- istio-system/example-ingress
hosts:
- reviews.example
http:
- name: primary
route:
- destination:
host: reviews-stable
weight: 100
- destination:
host: reviews-canary
weight: 0
- Execute JMeter load test against
review.example
:
...
<HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="HTTP Request" enabled="true">
<elementProp name="HTTPsampler.Arguments" elementType="Arguments" guiclass="HTTPArgumentsPanel" testclass="Arguments" testname="User Defined Variables" enabled="true">
<collectionProp name="Arguments.arguments"/>
</elementProp>
<stringProp name="HTTPSampler.domain">reviews.example</stringProp>
<stringProp name="HTTPSampler.port"></stringProp>
<stringProp name="HTTPSampler.protocol">http</stringProp>
<stringProp name="HTTPSampler.contentEncoding"></stringProp>
<stringProp name="HTTPSampler.path">/health</stringProp>
<stringProp name="HTTPSampler.method">GET</stringProp>
<boolProp name="HTTPSampler.follow_redirects">true</boolProp>
<boolProp name="HTTPSampler.auto_redirects">false</boolProp>
<boolProp name="HTTPSampler.use_keepalive">true</boolProp>
<boolProp name="HTTPSampler.DO_MULTIPART_POST">false</boolProp>
<stringProp name="HTTPSampler.embedded_url_re"></stringProp>
<stringProp name="HTTPSampler.proxyHost">10.20.30.40</stringProp>
<stringProp name="HTTPSampler.proxyPort">80</stringProp>
<stringProp name="HTTPSampler.connect_timeout"></stringProp>
<stringProp name="HTTPSampler.response_timeout"></stringProp>
</HTTPSamplerProxy>
...
- Aim canary deployment service to Deployment v2:
curl -X PATCH -H "Content-Type: application/merge-patch+json" --data '{"spec": {"selector": {"version": "v2"}}}' http://localhost:8080/api/v1/namespaces/default/services/reviews-canary
- Observe some downtime:
$ tail results -f | grep -Ev '(200,OK)'
1598539426258,165,HTTP Request,503,Service Unavailable,Thread Group 1-3,,false,,148,177,5,5,http://reviews.example/health,165,0,28
1598539426256,184,HTTP Request,503,Service Unavailable,Thread Group 1-4,,false,,149,177,5,5,http://reviews.example/health,184,0,29
All traffic goes through stable deployment service, but it was not touched at all. Please help to explain downtime.
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
$ istioctl version --remote
client version: 1.7.0
control plane version: 1.7.0
data plane version: 1.7.0 (5 proxies)
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:54Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.13-gke.1", GitCommit:"688c6543aa4b285355723f100302d80431e411cc", GitTreeState:"clean", BuildDate:"2020-07-21T02:37:26Z", GoVersion:"go1.13.9b4", Compiler:"gc", Platform:"linux/amd64"}
How was Istio installed?
istioctl install --set profile=demo
kubectl label namespace default istio-injection=enabled
Environment where bug was observed (cloud vendor, OS, etc) GKE, 1.16.13-gke.1
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 29 (20 by maintainers)
@pliutak-nih not at this time. I have identified the root cause and am trying to get confirmation from the Envoy team whether this is an Envoy bug or if there is something we can do on Istio side.
This is highly unlikely to land in 1.7 as it required substantial changes to the telemetry code in order to work, (which is why it has not been merged - should be soon though)
@ngms06 the issue isn’t so much as delay its that we are swapping something non atomically.
Before: listener points to cluster v1, we have clusters [v1] - all good
Intermediate state (should be a couple ms): listener points to cluster v1, we have clusters [v2] - broken
After: listener points to cluster v2, we have clusters [v2] - all good
@DmitryKiselev in the short term you can use VS to do this (see https://medium.com/infinite-lambda/canary-and-blue-green-deployments-with-helm-and-istio-4139886447b6 for example). You can also make sure the Service is created in the “right” order (see comment in https://github.com/istio/istio/issues/26861#issuecomment-686840121) - this is a HUGE hack though, just thinking short term mitigation.
My PR is not yet merged but I am hoping to discuss this issue in the networking WG meeting in 2 days to get a wider audience for some ideas.
If the server pod shows
503 NR
its very likely the same. If its something else, may be differentHere is what is happening. When both the Services point to the same pod, we get into a conflict, because Istio needs to setup inbound configuration and we have two different services. When this happens, conflict resolution picks the oldest service first. In our case, either canary was created first or it was created at exact same time at which point we resolve by alphabetical - canary comes before stable. This means every time we switch the service, we are now switching which service “wins”. This causes some churn in envoy, which is somehow causing the issue.
Full reproducing config:
Then just
fortio load -qps 40 -t 0s -H "Host: reviews.default.svc.cluster.local" IP/health
Access logs on inbound sidecar:
Hi John, thanks for looking into it.
I see that you’re trying to patch reviews-stable. Please note, I’m patching reviews-canary on step 4 (it’s exactly, what argo-rollouts is doing prior to switch traffic to canary).