istio: Slower sidecar proxy startup time on larger kubernetes cluster
Istio Details
Istio Version: 0.4.0 Cluster Detail: GKE with Container-Optimized OS (cos)
Description
Hi,
I’ve just installed Istio 0.4.0 on a cluster with the default settings as in istio.yaml
and noticed that there is a timeframe after pod startup when application could not make any connection to other services (http, database, etc). The problem is, that timeframe is larger and unacceptable for larger clusters.
I used this small connection checker program that tries to connect to mysql and http service every second. When running it on a cluster of 4 nodes, and 11 pods (basically, it’s a fresh cluster with Istio
and Bookinfo
app installed), I saw a period of 3 second where the app could not make any connection outside (indicated by ECONNREFUSED
).
Three seconds would still be acceptable for production use cases, but when running the exact same application on a larger cluster that consist of 111 nodes, and ~2100 pods, the period when apps could not make any connection outside was increased to around ~5mins and more.
I noticed that on larger clusters, starting main dispatch loop
in istio sidecar took much longer time than the one on smaller cluster:
larger cluster, start 07:16:35, end 07:19:52:
}ERROR: logging before flag.Parse: I0111 07:16:34.393299 1 watcher.go:270] Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster api-istio-coba --service-node sidecar~10.32.16.138~api-istio-coba-1836146160-7rq5j.default~default.svc.cluster.local --max-obj-name-len 189 --service-zone AvailabilityZone couldn't find the given cluster node]
[2018-01-11 07:16:35.537][18][warning][main] external/envoy/source/server/server.cc:157] initializing epoch 0 (hot restart version=9.200.16384.256)
[2018-01-11 07:16:35.753][18][warning][main] external/envoy/source/server/server.cc:344] starting main dispatch loop
[2018-01-11 07:19:52.149][18][warning][main] external/envoy/source/server/server.cc:328] all clusters initialized. initializing init manager
[2018-01-11 07:19:53.518][18][warning][config] external/envoy/source/server/listener_manager_impl.cc:448] all dependencies initialized. starting workers
[2018-01-11 07:20:56.177][18][warning][main] external/envoy/source/server/drain_manager_impl.cc:62] shutting down parent after drain
[libprotobuf INFO external/mixerclient_git/src/check_cache.cc:155] Add a new Referenced for check cache: Absence-keys: source.uid, Exact-keys: context.protocol, destination.ip, destination.service, destination.uid, source.ip,
smaller cluster, start 06:50:10, end 06:50:12:
}ERROR: logging before flag.Parse: I0111 06:50:10.595332 1 watcher.go:270] Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster api-istio-coba --service-node sidecar~10.0.0.18~api-istio-coba-2549837242-j3b08.default~default.svc.cluster.local --max-obj-name-len 189]
[2018-01-11 06:50:10.620][10][warning][main] external/envoy/source/server/server.cc:157] initializing epoch 0 (hot restart version=9.200.16384.256)
[2018-01-11 06:50:10.846][10][warning][main] external/envoy/source/server/server.cc:344] starting main dispatch loop
[2018-01-11 06:50:12.393][10][warning][main] external/envoy/source/server/server.cc:328] all clusters initialized. initializing init manager
[2018-01-11 06:50:12.932][10][warning][config] external/envoy/source/server/listener_manager_impl.cc:448] all dependencies initialized. starting workers
[libprotobuf INFO external/mixerclient_git/src/check_cache.cc:155] Add a new Referenced for check cache: Absence-keys: source.uid, Exact-keys: context.protocol, destination.ip, destination.service, destination.uid, source.ip,
Is this an expected behaviour? Is there anything we could do to speed up the “start up”?
Thanks Andre
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 1
- Comments: 24 (9 by maintainers)
@jmarantz I definitively saw 10x improvements in startup times when upgrading to latest proxy, maybe that was a result of different change (our previous version is based on Envoy from end of October), but the 10x improvement was definitely there.
FWIW the loading-time improvements were only when compiling for hot-restart. There’s still a big regex overhead independent of the previously O(numStats^2) startup overhead for hotRestart. I have not noticed any stats-related slowdown when hot-restart is compiled up, but the regex problems remain and I’m exploring ways of speeding that up. Is the JSON fragment in the comment above enough to repro the slowdown you are seeing in Istio?