istio: Restrictive network policies result in slow istio-proxy start ups due to locality lookup
Bug description On deployments which do not permit egress policy, the startup of istio-proxy is quite significantly delayed (10s or so) when attempting to read the metadata:
2019-09-24T18:21:41.246128Z info Reconciling retry (budget 10)
2019-09-24T18:21:41.246138Z info Epoch 0 starting
2019-09-24T18:21:41.246879Z info watching /etc/certs for changes
2019-09-24T18:21:50.654026Z info Envoy proxy is NOT ready: failed retrieving Envoy stats: Get http://127.0.0.1:15000/stats?usedonly: dial tcp 127.0.0.1:15000: connect: connection refused
2019-09-24T18:21:51.264064Z warn Error fetching GCP zone: Get http://169.254.169.254/computeMetadata/v1/instance/zone: dial tcp 169.254.169.254:80: i/o timeout
2019-09-24T18:21:51.264459Z info Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster ssh --service-node sidecar~10.206.3.25~istio-6499cc9554-nrg5g.istio-tcp-issue~istio-tcp-issue.svc.cluster.local --max-obj-name-len 189 --local-address-ip-version v4 --allow-unknown-fields -l error --component-log-level misc:error --concurrency 1]
We run with global.localityLbSetting.enabled
= false, so why does istio/proxy even need zonal information? Perhaps it could look this up as/when it’s required/enabled?
In our particular use case, we inject a liveness probe to istio-pilot which is aggressive enough to detect this as a failed proxy, thus getting it stuck in a crash loop:
❯ k get pods
NAME READY STATUS RESTARTS AGE
istio-6499cc9554-kxsvf 1/2 CrashLoopBackOff 6 11m
Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [x] Networking [ ] Performance and Scalability [x] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior Locality lookups should fail faster than 10s
Steps to reproduce the bug
Set a restrictive NetworkPolicy
which blocks the egress traffic.
Version (include the output of istioctl version --remote
and kubectl version
)
1.3.0
How was Istio installed? Helm
Environment where bug was observed (cloud vendor, OS, etc) GKE
Additionally, please consider attaching a cluster state archive by attaching the dump file to this issue.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 20 (14 by maintainers)
@Minh-Ng yes. there are two issues here. the
Locality()
one causing slow starts – which your logs show is still happening to you (and for which I’ve started that draft PR) – and the tracing configuration failing.For the tracing configuration failing, to use Stackdriver as the tracer, you need to have a project to report against. If that is not configured, tracing to SD will not work. Whether or not that should bork the bootstrap is a reasonable question, but the proxy will be in a “bad” state (as tracing will fail). Let’s pursue the tracing bit in another issue.