istio: 503 during VirtualService update in 1.8

Bug description We’re trying to upgrade knative to Istio 1.8, but are running into issues (knative/net-istio#426). We have a test that creates a VirtualService, then updates it 10 times to return different headers. We usually run 10 of these tests in parallel.

After updating to 1.8, I consistently see 503s between updates. [ingress-conformance-0-update-bflpksfb] and [ingress-conformance-0-update-jwnqvzhi] reflect the headers in the response; as below. The outage happens when the header gets applied.

    update.go:191: [ingress-conformance-10-update-bxlqwtjl] Got OK status!
    util.go:975: Error meeting response expectations for "http://ingress-conformance-10-update-uvnvrjow.example.com": got unexpected status: 503, expected map[200:{}]
    util.go:975: HTTP/1.1 503 Service Unavailable
        Date: Tue, 01 Dec 2020 21:04:57 GMT
        Server: istio-envoy
        Content-Length: 0
        
    update.go:191: [ingress-conformance-10-update-xuzhueaw] Got OK status!

Envoy’s logs point to there not being a route during this time, which is odd:

[2020-12-01T21:04:57.972Z] "GET / HTTP/1.1" 503 NR "-" 0 0 0 - "10.128.0.39" "knative.dev/TestIngressConformance/10/update/ingress-conformance-10-update-gouiithw" "e27c8599-d973-4d80-876d-4838ac4e0523" "ingress-conformance-10-update-uvnvrjow.example.com" "-" - - 10.8.0.7:8080 10.128.0.39:23011 - -

Recently

[ ] Docs [ ] Installation [X] Networking [X] Performance and Scalability [ ] Extensions and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure [ ] Upgrade

Expected behavior VirtualService should stay available between updates.

Version (include the output of istioctl version --remote and kubectl version --short and helm version --short if you used Helm)

client version: 1.7.2
control plane version: 1.8.0
data plane version: 1.8.0 (1 proxies)

How was Istio installed? Latest 1.8 using istioctl with the following config:

apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      proxy:
        autoInject: disabled
      useMCP: false
      # The third-party-jwt is not enabled on all k8s.
      # See: https://istio.io/docs/ops/best-practices/security/#configure-third-party-service-account-tokens
      jwtPolicy: first-party-jwt
    pilot:
      autoscaleMin: 3
      autoscaleMax: 10
      cpu:
        targetAverageUtilization: 60
    gateways:
      istio-ingressgateway:
        autoscaleMin: 2
        autoscaleMax: 5

  addonComponents:
    pilot:
      enabled: true

  components:
    ingressGateways:
      - name: istio-ingressgateway
        enabled: true
        k8s:
          resources:
            limits:
              cpu: 3000m
              memory: 2048Mi
            requests:
              cpu: 1000m
              memory: 1024Mi

Environment where the bug was observed (cloud vendor, OS, etc) GKE

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (11 by maintainers)

Most upvoted comments

This is fixed in https://github.com/istio/istio/pull/29060. Basically we accidentally enabled a feature in 1.8 due to a merge conflict when it wasn’t ready. Its fixed in 1.8.1 (as tested by 100 iterations of the above reproducer). https://github.com/istio/istio/issues/29131 tracks the re-enablement of this feature and marks fixing this as a blocker.

The root cause is the feature will filter out clusters we don’t need in the gateway. However, when we get the new VS we then send the cluster and route referncing the cluster to envoy at the same time. There is a race here if the route is applied before the cluster.

howardjohn on Dec 7, 2020

OK, I have a repro!

Create 12 pods and services, each listening on its own port:

for i in {20000..20012}; do export i
envsubst < pod.yaml | kubectl apply -f - ; envsubst < service.yaml | kubectl apply -f -
done

Repeatedly update virtual service to point at different services:

for i in {20000..20012}; do export i
envsubst < vs.yaml | k apply -f -
done

Meanwhile make a bunch of curl requests:

while :
do
curl -s http://<istio-ingressgateway-ip>:80/ -HHost:ingress-conformance.example.com -I | tee -a reqlogs.log
done

It doesn’t always happen, but I’ve seen this create a 503 a few times. Sometimes instead I see the requests start to hang for a long time.

YAMLs below: pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    test-pod: ingress-conformance-$i
  name: ingress-conformance-$i
spec:
  containers:
  - env:
    - name: PORT
      value: "$i"
    image: gcr.io/knative-samples/helloworld-go
    name: foo
    ports:
    - containerPort: $i
      name: http
      protocol: TCP

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: ingress-conformance-$i
spec:
  ports:
  - name: http
    port: $i
    protocol: TCP
    targetPort: $i
  selector:
    test-pod: ingress-conformance-$i
  sessionAffinity: None
  type: ClusterIP

vs.yaml

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ingress-conformance
spec:
  gateways:
  - knative-serving/knative-ingress-gateway
  hosts:
  - ingress-conformance.example.com
  http:
  - route:
    - destination:
        host: ingress-conformance-$i.default.svc.cluster.local
        port:
          number: $i
      headers:
        respons:
          set:
            Who-Are-You: ingress-conformance-$i

knative-ingress-gateway just looks like this:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  labels:
    networking.knative.dev/ingress-provider: istio
    serving.knative.dev/release: devel
  name: knative-ingress-gateway
  namespace: knative-serving
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP

arturenault on Dec 7, 2020