istio: Istio Sidecar consuming high CPU Istio 1.30-1.3.3

Bug description We are running Istio v1.3.3 (this also happend on v1.3.0), most of our applications are running fine. However, in one of our deployments the CPU on the sidecar is pegged at 1.0 for the life of the pod. This happens for the majority of the pods in the same deployment but not all. The pod contains nginx, istio-proxy and an application container all of which have normal CPU use except istio-proxy. I turned on debugging on the sidecar however there are not really interesting.

 istio-proxy [2019-09-25 09:45:06.322][27][debug][client] [external/envoy/source/common/http/codec_client.cc:95] [C1145026] response complete
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:214] Called Mixer::Filter : onDestroy state: 2
 istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:199] [C1145026] response complete
 istio-proxy [2019-09-25 09:45:06.322][27][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:237] [C1145026] moving to ready
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [src/envoy/http/mixer/filter.cc:228] Called Mixer::Filter : log
 istio-proxy [2019-09-25 09:45:06.322][27][debug][filter] [./src/envoy/http/mixer/report_data.h:144] No dynamic_metadata found for filter envoy.filters.http.rbac
 istio-proxy [2019-09-25 09:45:06.323][27][debug][filter] [src/istio/control/client_context_base.cc:138] Report attributes: attributes {
 istio-proxy   key: "connection.mtls"
 istio-proxy   value {
 istio-proxy     bool_value: false
 istio-proxy   }
 istio-proxy }
 istio-proxy attributes {
 istio-proxy   key: "context.protocol"
 istio-proxy   value {
 istio-proxy     string_value: "http"
 istio-proxy   }
 istio-proxy }
 istio-proxy attributes {
 istio-proxy   key: "context.prox
 istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:520] [C1145208] remote close
 istio-proxy [2019-09-25 09:45:06.324][27][debug][connection] [external/envoy/source/common/network/connection_impl.cc:190] [C1145208] closing socket: 0
 istio-proxy [2019-09-25 09:45:06.324][27][debug][main] [external/envoy/source/server/connection_handler_impl.cc:80] [C1145208] adding to cleanup list
 istio-proxy [2019-09-25 09:45:06.324][27][debug][filter] [src/envoy/tcp/mixer/filter.cc:100] [C1138233] Called tcp filter onRead bytes: 17
 istio-proxy [2019-09-25 09:45:06.324][28][debug][router] [external/envoy/source/common/router/router.cc:1002] [C637330][S2924433164096281138] upstream headers complete: end_stream=false
 istio-proxy [2019-09-25 09:45:06.324][28][debug][filter] [src/envoy/http/mixer/filter.cc:141] Called Mixer::Filter : encodeHeaders 2
 istio-proxy [2019-09-25 09:45:06.324][28][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1435] [C637330][S2924433164096281138] encoding headers via codec (end_stream=false):

I was able to collect an strace from the PID of the problematic pod

sudo strace -c -f -p 5296
strace: Process 5296 attached with 15 threads
^Cstrace: Process 5296 detached
strace: Process 5329 detached
strace: Process 5330 detached
strace: Process 5331 detached
strace: Process 5332 detached
strace: Process 5338 detached
strace: Process 5340 detached
strace: Process 5341 detached
strace: Process 5342 detached
strace: Process 5343 detached
strace: Process 5344 detached
strace: Process 5345 detached
strace: Process 5346 detached
strace: Process 5347 detached
strace: Process 6356 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 73.39    0.176384         256       687        44 futex
  9.04    0.021728          49       436           nanosleep
  8.82    0.021201         122       173           epoll_pwait
  6.45    0.015504        3100         5         2 restart_syscall
  0.64    0.001548          30        50           write
  0.50    0.001208          11       109        50 read
  0.31    0.000753          75        10           sched_yield
  0.30    0.000715          14        50        25 accept4
  0.24    0.000567          22        25           close
  0.16    0.000385           7        50           epoll_ctl
  0.08    0.000181           7        25           getsockname
  0.07    0.000178           7        25           setsockopt
------ ----------- ----------- --------- --------- ----------------
100.00    0.240352                  1645       121 total

Here is the output of kubectl -n $namespace exec -it $pod -c istio-proxy – top on a pod with busy-envoy

top - 15:33:18 up 18:25,  0 users,  load average: 1.58, 1.56, 1.53
Tasks:   3 total,   1 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 22.5 us, 15.0 sy,  0.0 ni, 62.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  7807704 total,  1493152 free,  3275896 used,  3038656 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4147820 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                              
   21 istio-p+  20   0  228232 112156  19180 S 104.7  1.4  51:18.67 envoy                                                                                                                                                                                                                                                                                                
    1 istio-p+  20   0  138296  25400  20892 S   0.0  0.3   0:07.84 pilot-agent                                                                                                                                                                                                                                                                                          
   33 istio-p+  20   0   36836   3016   2584 R   0.0  0.0   0:00.00 top

There has been a forum thread about it, but this is now ongoing without any resolution. So far the only workaround has been to downgrade the sidecar to 1.2.6

https://discuss.istio.io/t/istio-sidecar-consuming-high-cpu/3894

Any ideas on how I can troubleshoot further?

Affected product area (please put an X in all that apply)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [x ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior CPU not going high

Steps to reproduce the bug Unable to reproduce

Version (include the output of istioctl version --remote and kubectl version)

citadel version: 1.3.0-1.3.3 galley version: 1.3.0-1.3.3 ingressgateway version: 1.3.0-1.3.3 pilot version: 1.3.0-1.3.3 policy version: 1.3.0-1.3.3 sidecar-injector version: 1.3.0-1.3.3 telemetry version: 1.3.0-1.3.3

How was Istio installed? helm chart

Environment where bug was observed (cloud vendor, OS, etc) amazon eks

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 7
  • Comments: 30 (13 by maintainers)

Most upvoted comments

Try --set pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s --set global.proxy.protocolDetectionTimeout=0s. Somehow we manged to make this extremely complicated to configure…

To verify it applied correctly, run curl localhost:15000/config_dump | grep listener_filters_timeout. Everything should be "listener_filters_timeout": 0

@lambdai found the bug tonight. fix will be out shortly.

In the meantime, we think setting global.proxy.protocolDetectionTimeout = 0 will address the issue for unpatched 1.3 versions. See https://istio.io/docs/reference/config/installation-options/#global-options

This is fixed in 1.3.5 (released today), 1.4.0 (releasing soon), and mitigations provided for other releases, so I think its safe to close this. Thanks everyone for helping us track this down

Experienced the same issue with istio 1.3.4. Fix with disabling protocol detection timeout worked for me.

Try --set pilot.env.PILOT_INBOUND_PROTOCOL_DETECTION_TIMEOUT=0s --set global.proxy.protocolDetectionTimeout=0s

I have redeployed Istio 1.3.1 into one of our environments with these settings and can confirm that the high CPU usage disappears on the sidecars.

I did the test on my production env where this issue happens, using docker.io/francoispesce/proxyv2:1.3.3-gdb

on a pod with high-cpu usage envoy:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f1a5998ed00 (LWP 24) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7f1a58a75700 (LWP 25) "default-executo" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f1a58274700 (LWP 26) "resolver-execut" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f1a57a73700 (LWP 27) "grpc_global_tim" 0x00007f1a58e4d709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f1a57272700 (LWP 28) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7f1a56a71700 (LWP 29) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7f1a56270700 (LWP 30) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7f1a55a6f700 (LWP 31) "envoy" 0x00007f1a58b7da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7f1a5526e700 (LWP 32) "envoy" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7f1a54a6d700 (LWP 33) "envoy" 0x00007f1a58e4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7f1a5426c700 (LWP 34) "envoy" 0x00007f1a58b774d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6

for comparison, on a a pod with no high-cpu usage envoy I’ve got this:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f940423ed00 (LWP 22) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  2    Thread 0x7f9403325700 (LWP 23) "default-executo" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7f9402b24700 (LWP 24) "resolver-execut" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  4    Thread 0x7f9402323700 (LWP 25) "grpc_global_tim" 0x00007f94036fd709 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  5    Thread 0x7f9401b22700 (LWP 26) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  6    Thread 0x7f9401321700 (LWP 27) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  7    Thread 0x7f9400b20700 (LWP 28) "envoy" 0x00007f94034274d9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
  8    Thread 0x7f940031f700 (LWP 29) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  9    Thread 0x7f93ffb1e700 (LWP 30) "envoy" 0x00007f940342da13 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
  10   Thread 0x7f93ff31d700 (LWP 31) "envoy" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
  11   Thread 0x7f93feb1c700 (LWP 32) "envoy" 0x00007f94036fd360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

Strangely, after quitting the gdb process, envoy process was killed (did not show up on ps auxw anymore).

Moreover, on previous debugging, in our case, we were able to see (using strace on the node) that envoy was in a loop of epoll_wait() on a socket (from an established connection from another pod in the mesh). A weird thing was that on that connection there were only 2 segments sent in/out.

strace -c -f -p $pid

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 70.69   17.048433          38    448634           epoll_wait
 29.14    7.028070       42594       165        27 futex
  0.17    0.040000       40000         1         1 restart_syscall
  0.00    0.000000           0        58           read
  0.00    0.000000           0        59           write
  0.00    0.000000           0         1           close
  0.00    0.000000           0        41        20 readv
  0.00    0.000000           0        20           writev
  0.00    0.000000           0         9           epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00   24.116503                448988        48 total

strace -f -r -p $pid on the node where the busy envoy runs, we see lots of

[pid 26651]      0.000036 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000036 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000037 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000038 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000042 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000045 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000562 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 5) = 1
[pid 26651]      0.000051 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 1) = 1
[pid 26651]      0.000038 epoll_wait(15, [{EPOLLOUT, {u32=603, u64=603}}], 128, 1) = 1
tcp    ESTAB      0      0      x.x.x.202:15006              x.x.x.203:36810               users:(("envoy",pid=29,fd=603))
     skmem:(r0,rb12582912,t0,tb12582912,f0,w0,o0,bl0) ts sack cubic wscale:9,9 rto:204 rtt:0.013/0.006 mss:8899 cwnd:10 segs_in:2 send 54763.1Mbps lastsnd:38913588 lastrcv:38913588 lastack:38913588 rcv_space:26697

We have recently upgraded from 1.2.0 to 1.3.1 and experiencing the same issue.

Did some investigation and it seems like the issue for us only appears when Istio acts as an HTTP proxy. Sidecar is using anywhere between 1-2 CPU cores if the service definition has the name: http at the port section (as the documentation recommends it). If the sidecar acts only as a TCP proxy there is no issue.

For example, we only have at the moment liveness and readiness checks running, no real traffic.

In the proxy logs I’m seeing the switch when I add ‘name: http’ to the service definition port section;

   - name: http
     nodePort: 30920
     port: 80
     protocol: TCP
     targetPort: 6450

Log section:

[2019-11-04T10:44:57.965Z] "- - -" 0 - "-" "-" 0 0 4002 - "-" "-" "-" "-" "127.0.0.1:6450" inbound|80||service2.service2.svc.cluster.local 127.0.0.1:44098 172.XX.XX.56:6450 172.XX.XX.4:61737 - -
[2019-11-04T10:45:03.751Z] "GET /liveness HTTP/1.1" 200 - "-" "-" 0 32 6 6 "-" "kube-probe/1.14" "3254795b-0327-469e-9fec-866184147f52" "172.XX.XX.56:6450" "127.0.0.1:6450" inbound|80|http|service2.service2.svc.cluster.local - 172.XX.XX.56:6450 172.XX.XX..35:53446 - default
[2019-11-04T10:45:03.870Z] "GET /readiness HTTP/1.1" 200 - "-" "-" 0 215 5 4 "-" "kube-probe/1.14" "85ef2b38-03c3-4519-8601-ff0fc28745fd" "172.XX.XX.56:6450" "127.0.0.1:6450" inbound|80|http|service2.service2.svc.cluster.local - 172.XX.XX.56:6450 172.XX.XX.35:53452 - default

Service2 and service3 has ‘name: http’ at the port definition the others not.

CPU usage under with Istio 1.3.1:

NAMESPACE          NAME                                      CPU(cores)   MEMORY(bytes)
service1           service1-65f7d9d8c4-26cd4                      8m           111Mi
service2           service2-7f75b699d4-29q85                1115m        166Mi
service3           service3-6fc5b956cf-sbmbt                     1502m        108Mi
service4           service4-6f594f98c-zdq5p                      10m          169Mi 

Downgraded back to Istio 1.2.0 and the issue disappeared. CPU usage under 1.2.0:

NAMESPACE          NAME                                      CPU(cores)   MEMORY(bytes)
service1       service1-65f7d9d8c4-zdq5p                      8m           109Mi
service2      service2-7f75b699d4-hp5dq                      7m        169Mi
service3           service3-6fc5b956cf-xt9bc                2m        100Mi
service4           service4-6f594f98c-s9np6                     7m          170Mi

istio is getting deployed with the following helm template generation:

helm template /istio-${ISTIO_VERSION}/install/kubernetes/helm/istio \
                --name istio \
                --namespace istio-system \
                --set gateways.istio-ingressgateway.serviceAnnotations."service\.beta\.kubernetes\.io/azure-dns-label-name"="$DNS_LABEL_NAME" \
                --set gateways.istio-egressgateway.enabled='true' \
                --set global.outboundTrafficPolicy.mode="REGISTRY_ONLY" \
                --set global.proxy.accessLogFile="/dev/stdout" \
                --set global.controlPlaneSecurityEnabled=true \
                --set mixer.adapters.useAdapterCRDs=false \
                --set grafana.enabled=true \
                --set tracing.enabled=true \
                --set kiali.enabled=true \
                --set "kiali.dashboard.jaegerURL=http://jaeger-query:16686" \
                --set "kiali.dashboard.grafanaURL=http://grafana:3000" \
                > /home/admin/istio-$ISTIO_VERSION.yaml

edit: fixed bad formatting