istio: Restrictive network policies result in slow istio-proxy start ups due to locality lookup

Bug description On deployments which do not permit egress policy, the startup of istio-proxy is quite significantly delayed (10s or so) when attempting to read the metadata:

2019-09-24T18:21:41.246128Z     info    Reconciling retry (budget 10)
2019-09-24T18:21:41.246138Z     info    Epoch 0 starting
2019-09-24T18:21:41.246879Z     info    watching /etc/certs for changes
2019-09-24T18:21:50.654026Z     info    Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refused
2019-09-24T18:21:51.264064Z     warn    Error fetching GCP zone: Get http://169.254.169.254/computeMetadata/v1/instance/zone: dial tcp 169.254.169.254:80: i/o timeout
2019-09-24T18:21:51.264459Z     info    Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster ssh --service-node sidecar~10.206.3.25~istio-6499cc9554-nrg5g.istio-tcp-issue~istio-tcp-issue.svc.cluster.local --max-obj-name-len 189 --local-address-ip-version v4 --allow-unknown-fields -l error --component-log-level misc:error --concurrency 1]

We run with global.localityLbSetting.enabled = false, so why does istio/proxy even need zonal information? Perhaps it could look this up as/when it’s required/enabled?

In our particular use case, we inject a liveness probe to istio-pilot which is aggressive enough to detect this as a failed proxy, thus getting it stuck in a crash loop:

❯ k get pods
NAME                   READY   STATUS             RESTARTS   AGE
istio-6499cc9554-kxsvf   1/2     CrashLoopBackOff   6          11m

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [x] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior Locality lookups should fail faster than 10s

Steps to reproduce the bug Set a restrictive NetworkPolicy which blocks the egress traffic.

Version (include the output of istioctl version --remote and kubectl version) 1.3.0

How was Istio installed? Helm

Environment where bug was observed (cloud vendor, OS, etc) GKE

Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (14 by maintainers)

Most upvoted comments

@Minh-Ng yes. there are two issues here. the Locality() one causing slow starts – which your logs show is still happening to you (and for which I’ve started that draft PR) – and the tracing configuration failing.

For the tracing configuration failing, to use Stackdriver as the tracer, you need to have a project to report against. If that is not configured, tracing to SD will not work. Whether or not that should bork the bootstrap is a reasonable question, but the proxy will be in a “bad” state (as tracing will fail). Let’s pursue the tracing bit in another issue.

douglas-reid on Mar 24, 2021