istio: 503 during VirtualService update in 1.8
Bug description We’re trying to upgrade knative to Istio 1.8, but are running into issues (knative/net-istio#426). We have a test that creates a VirtualService, then updates it 10 times to return different headers. We usually run 10 of these tests in parallel.
After updating to 1.8, I consistently see 503s between updates. [ingress-conformance-0-update-bflpksfb] and [ingress-conformance-0-update-jwnqvzhi] reflect the headers in the response; as below. The outage happens when the header gets applied.
update.go:191: [ingress-conformance-10-update-bxlqwtjl] Got OK status!
util.go:975: Error meeting response expectations for "http://ingress-conformance-10-update-uvnvrjow.example.com": got unexpected status: 503, expected map[200:{}]
util.go:975: HTTP/1.1 503 Service Unavailable
Date: Tue, 01 Dec 2020 21:04:57 GMT
Server: istio-envoy
Content-Length: 0
update.go:191: [ingress-conformance-10-update-xuzhueaw] Got OK status!
Envoy’s logs point to there not being a route during this time, which is odd:
[2020-12-01T21:04:57.972Z] "GET / HTTP/1.1" 503 NR "-" 0 0 0 - "10.128.0.39" "knative.dev/TestIngressConformance/10/update/ingress-conformance-10-update-gouiithw" "e27c8599-d973-4d80-876d-4838ac4e0523" "ingress-conformance-10-update-uvnvrjow.example.com" "-" - - 10.8.0.7:8080 10.128.0.39:23011 - -
Recently
[ ] Docs [ ] Installation [X] Networking [X] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [ ] Upgrade
Expected behavior VirtualService should stay available between updates.
Version (include the output of istioctl version --remote
and kubectl version --short
and helm version --short
if you used Helm)
client version: 1.7.2
control plane version: 1.8.0
data plane version: 1.8.0 (1 proxies)
How was Istio installed? Latest 1.8 using istioctl with the following config:
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
values:
global:
proxy:
autoInject: disabled
useMCP: false
# The third-party-jwt is not enabled on all k8s.
# See: https://istio.io/docs/ops/best-practices/security/#configure-third-party-service-account-tokens
jwtPolicy: first-party-jwt
pilot:
autoscaleMin: 3
autoscaleMax: 10
cpu:
targetAverageUtilization: 60
gateways:
istio-ingressgateway:
autoscaleMin: 2
autoscaleMax: 5
addonComponents:
pilot:
enabled: true
components:
ingressGateways:
- name: istio-ingressgateway
enabled: true
k8s:
resources:
limits:
cpu: 3000m
memory: 2048Mi
requests:
cpu: 1000m
memory: 1024Mi
Environment where the bug was observed (cloud vendor, OS, etc) GKE
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 19 (11 by maintainers)
This is fixed in https://github.com/istio/istio/pull/29060. Basically we accidentally enabled a feature in 1.8 due to a merge conflict when it wasn’t ready. Its fixed in 1.8.1 (as tested by 100 iterations of the above reproducer). https://github.com/istio/istio/issues/29131 tracks the re-enablement of this feature and marks fixing this as a blocker.
The root cause is the feature will filter out clusters we don’t need in the gateway. However, when we get the new VS we then send the cluster and route referncing the cluster to envoy at the same time. There is a race here if the route is applied before the cluster.
OK, I have a repro!
It doesn’t always happen, but I’ve seen this create a 503 a few times. Sometimes instead I see the requests start to hang for a long time.
YAMLs below: pod.yaml
service.yaml
vs.yaml
knative-ingress-gateway just looks like this: