istio: Istio is picking up new virtualservice slowly

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [X] Networking [X] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster [ ] Virtual Machine [ ] Multi Control Plane

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

client version: 1.5.4
cluster-local-gateway version:
cluster-local-gateway version:
cluster-local-gateway version:
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
ingressgateway version: 1.5.4
pilot version: 1.5.4
pilot version: 1.5.4
pilot version: 1.5.4
data plane version: 1.5.4 (6 proxies)

How was Istio installed?

cat << EOF > ./istio-minimal-operator.yaml
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      proxy:
        autoInject: disabled
      useMCP: false
      # The third-party-jwt is not enabled on all k8s.
      # See: https://istio.io/docs/ops/best-practices/security/#configure-third-party-service-account-tokens
      jwtPolicy: first-party-jwt

  addonComponents:
    pilot:
      enabled: true
    prometheus:
      enabled: false

  components:
    ingressGateways:
      - name: istio-ingressgateway
        enabled: true
      - name: cluster-local-gateway
        enabled: true
        label:
          istio: cluster-local-gateway
          app: cluster-local-gateway
        k8s:
          service:
            type: ClusterIP
            ports:
            - port: 15020
              name: status-port
            - port: 80
              name: http2
            - port: 443
              name: https
EOF

./istioctl manifest generate -f istio-minimal-operator.yaml \
--set values.gateways.istio-egressgateway.enabled=false \
--set values.gateways.istio-ingressgateway.sds.enabled=true \
--set values.gateways.istio-ingressgateway.autoscaleMin=3 \
--set values.gateways.istio-ingressgateway.autoscaleMax=6 \
--set values.pilot.autoscaleMin=3 \
--set values.pilot.autoscaleMax=6 \
--set hub=icr.io/ext/istio  > istio.yaml

  kubectl apply -f istio.yaml    // more visibility than istioctl manifest apply

Environment where bug was observed (cloud vendor, OS, etc) IKS

When we create ~1k virtualservices in a single cluster, the ingress gateway is picking up new virtualservice slowly.

image The blue line in the chart indicates the overall time for probing to gateway pod return with success. (200 response code and expected header K-Network-Hash). The stepped increasing of time is caused by the exponential retry backoff to execute probing. But the overall trend seems to have a linear growth which takes ~50s for a new virtual service to be picked up with 800 virtual services present.

I also tried to dump and grep the configs in istio-ingress-gateway pod after the virtual service was created. Initially the output was empty and It takes about 1min for the belowing result to showup.

curl localhost:15000/config_dump |grep testabc
      "name": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local",
       "service_name": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local"
      "name": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|8022||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|8022||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.8022_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|80||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|80||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.80_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
       "service_name": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local"
          "sni": "outbound_.80_._.testabc-hhmjj-1.default.svc.cluster.local"
      "name": "outbound|9090||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|9090||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.9090_._.testabc-hhmjj-1-private.default.svc.cluster.local"
      "name": "outbound|9091||testabc-hhmjj-1-private.default.svc.cluster.local",
       "service_name": "outbound|9091||testabc-hhmjj-1-private.default.svc.cluster.local"
          "sni": "outbound_.9091_._.testabc-hhmjj-1-private.default.svc.cluster.local"
        "name": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud:80",
         "testabc.default.dev-serving.codeengine.dev.appdomain.cloud",
         "testabc.default.dev-serving.codeengine.dev.appdomain.cloud:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"
        "name": "testabc.default.svc.cluster.local:80",
         "testabc.default.svc.cluster.local",
         "testabc.default.svc.cluster.local:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"
        "name": "testabc.default.svc:80",
         "testabc.default.svc",
         "testabc.default.svc:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"
        "name": "testabc.default:80",
         "testabc.default",
         "testabc.default:80"
             "prefix_match": "testabc.default.dev-serving.codeengine.dev.appdomain.cloud"
           "cluster": "outbound|80||testabc-hhmjj-1.default.svc.cluster.local",
             "config": "/apis/networking.istio.io/v1alpha3/namespaces/default/virtual-service/testabc-ingress"
           "operation": "testabc-hhmjj-1.default.svc.cluster.local:80/*"
             "value": "testabc-hhmjj-1"

There is no mem/cpu pressure for istio components.

kubectl -n istio-system top pods
NAME                                     CPU(cores)   MEMORY(bytes)
cluster-local-gateway-644fd5f945-f4d6d   29m          953Mi
cluster-local-gateway-644fd5f945-mlhkc   34m          952Mi
cluster-local-gateway-644fd5f945-nt4qk   30m          958Mi
istio-ingressgateway-7759f4649d-b5whx    37m          1254Mi
istio-ingressgateway-7759f4649d-g2ppv    43m          1262Mi
istio-ingressgateway-7759f4649d-pv6qs    48m          1431Mi
istiod-6fb9877647-7h7wk                  8m           875Mi
istiod-6fb9877647-k6n9m                  9m           914Mi
istiod-6fb9877647-mncpn                  26m          925Mi

Below is a typical virtual service created by knative.

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  annotations:
    networking.knative.dev/ingress.class: istio.ingress.networking.knative.dev
  creationTimestamp: "2020-07-20T08:23:54Z"
  generation: 1
  labels:
    networking.internal.knative.dev/ingress: hello29
    serving.knative.dev/route: hello29
    serving.knative.dev/routeNamespace: default
  name: hello29-ingress
  namespace: default
  ownerReferences:
  - apiVersion: networking.internal.knative.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: Ingress
    name: hello29
    uid: 433c415d-901e-4154-bfd9-43178d0db192
  resourceVersion: "47989694"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/default/virtualservices/hello29-ingress
  uid: 20162863-b721-4d67-aa19-b84adc7dffe0
spec:
  gateways:
  - knative-serving/cluster-local-gateway
  - knative-serving/knative-ingress-gateway
  hosts:
  - hello29.default
  - hello29.default.dev-serving.codeengine.dev.appdomain.cloud
  - hello29.default.svc
  - hello29.default.svc.cluster.local
  http:
  - headers:
      request:
        set:
          K-Network-Hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
    match:
    - authority:
        prefix: hello29.default
      gateways:
      - knative-serving/cluster-local-gateway
    retries: {}
    route:
    - destination:
        host: hello29-cpwpf-1.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello29-cpwpf-1
      weight: 100
    timeout: 600s
  - headers:
      request:
        set:
          K-Network-Hash: 12a72f65db15ba3a00ad16b328c40b5398a86cc84ba3239ad37f4d5ef811b0fa
    match:
    - authority:
        prefix: hello29.default.dev-serving.codeengine.dev.appdomain.cloud
      gateways:
      - knative-serving/knative-ingress-gateway
    retries: {}
    route:
    - destination:
        host: hello29-cpwpf-1.default.svc.cluster.local
        port:
          number: 80
      headers:
        request:
          set:
            Knative-Serving-Namespace: default
            Knative-Serving-Revision: hello29-cpwpf-1
      weight: 100
    timeout: 600s
---

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 82 (59 by maintainers)

Commits related to this issue

Most upvoted comments

Here are some test results with tuning for PILOT_DEBOUNCE_AFTER parameter. I tested using PILOT_DEBOUNCE_MAX = 10s

  1. PILOT_DEBOUNCE_AFTER = 1s the 1st round I tested with a total of 800 ksvcs, the result is as following. The result is a lot better compares to the default value 100ms. There are no sudden jumps around ~200 indices. But I noticed the ready time increases to ~50s at the end of test. So I did another test to increase the total ksvc count to 1400. image

We can tell that the similar sudden jumps happens after index 800. So I assume that the longer PILOT_DEBOUNCE_AFTER will alleivate the stress for configuring envoy. But it will not solve the issue completely. image

  1. PILOT_DEBOUNCE_AFTER = 10s The first round tests for 800 ksvcs most of the ready time is around 13s. Which is the first probe after 10s in knative. image Below is the result for total 1400 ksvcs. With the debounce time setting to 10s the result after 800 became stable again. I suspect that as the total ksvc count grows, we will hit the sudden increasing again. image

@linsun We did not set limit CPU for istiod. Also we set istiod and both public gateway and private gateway replicas to fixed value of 3.

I have tried 1.6.8 and 1.7.0-rc.1 released today with no luck. This time I recorded some dashboard during the tests. image image

Seems that there did exist some errors for eds_sender. And some small pushes at ~15:05. I believe after the additional push the gateway got configured correctly. I will try to collect the detailed error logs.

with the fixes in https://github.com/istio/istio/issues/23029 I am able to reproduce the similar result as @sdake posted. The individual spikes are gone and I think It is a valid fix. Thank you ๐Ÿ‘ .

image I think we can now focus on the sudden increasing after 200 ksvcs. It should be another unknown issue and I will try to collect more logs.

image image

Here is another test which is our typical tests as I described at the beginning of this issue. We create 800 ksvc sequentially with 5s interval. Which takes roughly ~1hr for the whole 1600 virtual service to be created. We found the similar result, but there are NO errors in dashboard the pilot errors. From the checking for of endpoints in envoy we can tell that the endpoint configs are missing in pod istio-ingressgateway-6d899bb698-5fvg6 and istio-ingressgateway-6d899bb698-qckqf for ksvc coligotest-9 which stuck for >20minutes. Notice that from the dashboard Pilot Pushes there are small peaks at 10:15 and 11:15, which should get the missing endpoints configured correctly.

cluster-local-gateway-6d697dfd57-2l6wp:
172.30.137.186:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.137.186:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound_.8022_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound|8022||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound_.9090_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound|9090||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound_.9091_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound|9091||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
cluster-local-gateway-6d697dfd57-8hznw:
172.30.137.186:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.137.186:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound_.8022_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound|8022||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound_.9090_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound|9090||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound_.9091_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound|9091||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
cluster-local-gateway-6d697dfd57-kfx6g:
172.30.137.186:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.137.186:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound_.8022_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound|8022||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound_.9090_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound|9090||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound_.9091_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound|9091||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
istio-ingressgateway-6d899bb698-5fvg6:
172.30.252.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound_.8022_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound|8022||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound_.9090_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound|9090||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound_.9091_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound|9091||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
istio-ingressgateway-6d899bb698-c48jm:
172.30.137.186:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.137.186:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.192.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.246.51:8012               HEALTHY     OK                outbound|80||coligotest1-239-tx42v.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound_.8022_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound|8022||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound_.9090_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound|9090||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound_.9091_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound|9091||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
istio-ingressgateway-6d899bb698-qckqf:
172.30.252.179:8012              HEALTHY     OK                outbound_.80_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8012              HEALTHY     OK                outbound|80||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound_.8022_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:8022              HEALTHY     OK                outbound|8022||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound_.9090_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9090              HEALTHY     OK                outbound|9090||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound_.9091_._.coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local
172.30.252.179:9091              HEALTHY     OK                outbound|9091||coligotest1-239-tx42v-private.coligotest-9.svc.cluster.local

I think the missing of endpoint configs contributes the unexpected peak values in the following chart. We have not seen this in releases prior to release 1.6.
image

@sdake FYI, here is the benchmark tool weโ€™re using which will help to generate the Knative Service with different intervals as @lanceliuu mentioned above and can get the ingress_lb_ready duration time and dashboard. https://github.com/zhanggbj/kperf

These results are better - although ADS continues to disconnect and Envoy OOMs: https://github.com/istio/istio/issues/28192. I do feel like this is the first attempt of a PR that manages the numerous constraints of the protocol implementation Screen Shot 2020-10-22 at 8 38 11 AM .

when the connection is dropped because of the 5 second send timeout, the system falls over. Under heavy services churn, the proxy enters a super-overloaded state where it can take 200-800 seconds to recover - see: https://github.com/istio/istio/issues/25685#issuecomment-668488995.

Yes, that is what I am saying.

What is an acceptable delay for new VS being created? Obviously we want it to be as fast as possible but we need to be realistic so we can make the appropriate compromises

About the number relationship of Knative services:virtual service:K8s service, for 800 knative service, we only have 1.6k virtual service, FYI.

1 Knative service will create 2 virtualservice as below (we only use blue-ingress as we only use knative-serving/cluster-local-gateway knative-serving/knative-ingress-gateway)and 3 K8s services

k get vs -n e8e8f528-4b9a
NAME                 GATEWAYS                                                                          HOSTS
bluex-ingress        [knative-serving/cluster-local-gateway knative-serving/knative-ingress-gateway]   [bluex.e8e8f528-4b9a bluex.e8e8f528-4b9a.dev-serving.codeengine.dev.appdomain.cloud bluex.e8e8f528-4b9a.svc bluex.e8e8f528-4b9a.svc.cluster.local]                                                                                                                                                                                                                                                                                                                                     17h
bluex-mesh           [mesh]                                                                            [bluex.e8e8f528-4b9a bluex.e8e8f528-4b9a.svc bluex.e8e8f528-4b9a.svc.cluster.local]

Relevant PR: https://github.com/istio/istio/pull/24230 that we closed for now due to lack of time to test it

On Thu, Aug 13, 2020 at 2:43 PM Steven Dake notifications@github.com wrote:

@howardjohn https://github.com/howardjohn right thats what i was thinking of - denounce. Iโ€™ll work with the team to tune the denounce, and if we can find some improvement, perhaps I can assist in an exponential denounce - unless someone is already working on this. I remember this whole section of push code, and its unsuitability early on right when you joined the project ๐Ÿ‘

Cheers, -steve

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/istio/istio/issues/25685#issuecomment-673725373, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEYGXI2PGI3GYGRKEGDPYLSARNBLANCNFSM4PDGLZ3Q .

Look at pilot_proxy_queue_time, pilot_push_triggers, pilot_proxy_convergence_time. All are on the grafana dashboard. 1.6.7 fixed a critical bug around endpoints getting stuck, may try that