istio: Istio Sidecar consuming high CPU Istio 1.30-1.3.3
Bug description We are running Istio v1.3.3 (this also happend on v1.3.0), most of our applications are running fine. However, in one of our deployments the CPU on the sidecar is pegged at 1.0 for the life of the pod. This happens for the majority of the pods in the same deployment but not all. The pod contains nginx, istio-proxy and an application container all of which have normal CPU use except istio-proxy. I turned on debugging on the sidecar however there are not really interesting.
istio-proxy [2019-09-25 09:45:06.322][27][debug][client] [external/envoy/source/common/http/codec_client.cc:95] [C1145026] response complete
istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:214] Called Mixer::Filter : onDestroy state: 2
istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:199] [C1145026] response complete
istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:237] [C1145026] moving to ready
istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:228] Called Mixer::Filter : log
istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [./src/envoy/http/mixer/report_data.h:144] No dynamic_metadata found for filter envoy.filters.http.rbac
istio-proxy [2019-09-25 09:45:06.323][27][debug][filter] [src/istio/control/client_context_base.cc:138] Report attributes: attributes {
istio-proxy key: "connection.mtls"
istio-proxy value {
istio-proxy bool_value: false
istio-proxy }
istio-proxy }
istio-proxy attributes {
istio-proxy key: "context.protocol"
istio-proxy value {
istio-proxy string_value: "http"
istio-proxy }
istio-proxy }
istio-proxy attributes {
istio-proxy key: "context.prox
istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:520] [C1145208] remote close
istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:190] [C1145208] closing socket: 0
istio-proxy [2019-09-25 09:45:06.324][27][debug][main] [external/envoy/source/server/connection_handler_impl.cc:80] [C1145208] adding to cleanup list
istio-proxy [2019-09-25 09:45:06.324][27][debug][filter] [src/envoy/tcp/mixer/filter.cc:100] [C1138233] Called tcp filter onRead bytes: 17
istio-proxy [2019-09-25 09:45:06.324][28][debug][router] [external/envoy/source/common/router/router.cc:1002] [C637330][S2924433164096281138] upstream headers complete: end_stream=false
istio-proxy [2019-09-25 09:45:06.324][28][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
istio-proxy [2019-09-25 09:45:06.324][28][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1435] [C637330][S2924433164096281138] encoding headers via codec (end_stream=false):
I was able to collect an strace from the PID of the problematic pod
sudo strace -c -f -p 5296
strace: Process 5296 attached with 15 threads
^Cstrace: Process 5296 detached
strace: Process 5329 detached
strace: Process 5330 detached
strace: Process 5331 detached
strace: Process 5332 detached
strace: Process 5338 detached
strace: Process 5340 detached
strace: Process 5341 detached
strace: Process 5342 detached
strace: Process 5343 detached
strace: Process 5344 detached
strace: Process 5345 detached
strace: Process 5346 detached
strace: Process 5347 detached
strace: Process 6356 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.39 0.176384 256 687 44 futex
9.04 0.021728 49 436 nanosleep
8.82 0.021201 122 173 epoll_pwait
6.45 0.015504 3100 5 2 restart_syscall
0.64 0.001548 30 50 write
0.50 0.001208 11 109 50 read
0.31 0.000753 75 10 sched_yield
0.30 0.000715 14 50 25 accept4
0.24 0.000567 22 25 close
0.16 0.000385 7 50 epoll_ctl
0.08 0.000181 7 25 getsockname
0.07 0.000178 7 25 setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00 0.240352 1645 121 total
Here is the output of kubectl -n $namespace exec -it $pod -c istio-proxy – top on a pod with busy-envoy
top - 15:33:18 up 18:25, 0 users, load average: 1.58, 1.56, 1.53
Tasks: 3 total, 1 running, 2 sleeping, 0 stopped, 0 zombie
%Cpu(s): 22.5 us, 15.0 sy, 0.0 ni, 62.0 id, 0.0 wa, 0.0 hi, 0.4 si, 0.0 st
KiB Mem : 7807704 total, 1493152 free, 3275896 used, 3038656 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 4147820 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21 istio-p+ 20 0 228232 112156 19180 S 104.7 1.4 51:18.67 envoy
1 istio-p+ 20 0 138296 25400 20892 S 0.0 0.3 0:07.84 pilot-agent
33 istio-p+ 20 0 36836 3016 2584 R 0.0 0.0 0:00.00 top
There has been a forum thread about it, but this is now ongoing without any resolution. So far the only workaround has been to downgrade the sidecar to 1.2.6
https://discuss.istio.io/t/istio-sidecar-consuming-high-cpu/3894
Any ideas on how I can troubleshoot further?
Affected product area (please put an X in all that apply)
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [x ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior CPU not going high
Steps to reproduce the bug Unable to reproduce
Version (include the output of istioctl version --remote and kubectl version)
citadel version: 1.3.0-1.3.3 galley version: 1.3.0-1.3.3 ingressgateway version: 1.3.0-1.3.3 pilot version: 1.3.0-1.3.3 policy version: 1.3.0-1.3.3 sidecar-injector version: 1.3.0-1.3.3 telemetry version: 1.3.0-1.3.3
How was Istio installed? helm chart
Environment where bug was observed (cloud vendor, OS, etc) amazon eks
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 7
- Comments: 30 (13 by maintainers)
Try
--set pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s --set global.proxy.protocolDetectionTimeout=0s
. Somehow we manged to make this extremely complicated to configure…To verify it applied correctly, run
curl localhost:15000/config_dump | grep listener_filters_timeout
. Everything should be"listener_filters_timeout": 0
@lambdai found the bug tonight. fix will be out shortly.
In the meantime, we think setting global.proxy.protocolDetectionTimeout = 0 will address the issue for unpatched 1.3 versions. See https://istio.io/docs/reference/config/installation-options/#global-options
This is fixed in 1.3.5 (released today), 1.4.0 (releasing soon), and mitigations provided for other releases, so I think its safe to close this. Thanks everyone for helping us track this down
Experienced the same issue with istio
1.3.4
. Fix with disabling protocol detection timeout worked for me.I have redeployed Istio 1.3.1 into one of our environments with these settings and can confirm that the high CPU usage disappears on the sidecars.
I did the test on my production env where this issue happens, using
docker.io/francoispesce/proxyv2:1.3.3-gdb
on a pod with high-cpu usage envoy:
for comparison, on a a pod with no high-cpu usage envoy I’ve got this:
Strangely, after quitting the gdb process, envoy process was killed (did not show up on
ps auxw
anymore).Moreover, on previous debugging, in our case, we were able to see (using
strace
on the node) that envoy was in a loop ofepoll_wait()
on a socket (from an established connection from another pod in the mesh). A weird thing was that on that connection there were only 2 segments sent in/out.strace -c -f -p $pid
strace -f -r -p $pid
on the node where the busy envoy runs, we see lots ofWe have recently upgraded from 1.2.0 to 1.3.1 and experiencing the same issue.
Did some investigation and it seems like the issue for us only appears when Istio acts as an HTTP proxy. Sidecar is using anywhere between 1-2 CPU cores if the service definition has the name: http at the port section (as the documentation recommends it). If the sidecar acts only as a TCP proxy there is no issue.
For example, we only have at the moment liveness and readiness checks running, no real traffic.
In the proxy logs I’m seeing the switch when I add ‘name: http’ to the service definition port section;
Log section:
Service2 and service3 has ‘name: http’ at the port definition the others not.
CPU usage under with Istio 1.3.1:
Downgraded back to Istio 1.2.0 and the issue disappeared. CPU usage under 1.2.0:
istio is getting deployed with the following helm template generation:
edit: fixed bad formatting