istio: Potential memory leak from WASM NullPlugin onNetworkNewConnection

Bug description Istio-proxy is consuming increasing amount of memory for an application with moderate HTTP traffic (10-20 qps)

[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [x ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure

Expected behavior Istio-proxy memory should stay relatively constant over time.

Steps to reproduce the bug Here is a heap profile flame graph after service has been up for 12+ hours:

Screen Shot 2020-06-16 at 11 49 23 AM

The flame graph seems to indicate that the heap has been accumulating blocks allocated from operator() calls through the NullPlugin although I am at a bit of a loss to figure out a workaround from reading the relevant source code.

In case it’s helpful, here is the text version of the heap profile:

File: envoy Type: inuse_space Showing nodes accounting for 33124.62kB, 98.34% of 33682.43kB total Dropped 482 nodes (cum <= 168.41kB) flat flat% sum% cum cum% 27546.82kB 81.78% 81.78% 33411.07kB 99.19% std::__1::__function::__func::operator() 1816.88kB 5.39% 87.18% 2421.37kB 7.19% std::__1::__hash_table::__emplace_unique_key_args 1449.59kB 4.30% 91.48% 2052.53kB 6.09% std::__1::__hash_table::__emplace_unique_impl 1206.61kB 3.58% 95.06% 1206.61kB 3.58% std::__1::__hash_table::__rehash 819.49kB 2.43% 97.50% 819.49kB 2.43% OPENSSL_malloc 246.34kB 0.73% 98.23% 246.34kB 0.73% std::__1::basic_string::basic_string 28.98kB 0.086% 98.31% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ContextManagerImpl::createSslClientContext 7.83kB 0.023% 98.34% 32049.05kB 95.15% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::newConnection 2.06kB 0.0061% 98.34% 32053.88kB 95.16% Envoy::Network::ListenerImpl::listenCallback 0 0% 98.34% 613.34kB 1.82% ASN1_item_d2i 0 0% 98.34% 1439.37kB 4.27% Envoy::Config::GrpcMuxImpl::onDiscoveryResponse 0 0% 98.34% 1359.13kB 4.04% Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate 0 0% 98.34% 1144.74kB 3.40% Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*):😒_0::__invoke 0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Context::onCreate 0 0% 98.34% 32009.40kB 95.03% Envoy::Extensions::Common::Wasm::Context::onNetworkNewConnection 0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Null::NullPlugin::ensureContext 0 0% 98.34% 2414.76kB 7.17% Envoy::Extensions::Common::Wasm::Null::Plugin::Stats::PluginContext::onNewConnection 0 0% 98.34% 992.35kB 2.95% Envoy::Extensions::TransportSockets::Tls::ClientContextImpl::ClientContextImpl 0 0% 98.34% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::onAddOrUpdateSecret 0 0% 98.34% 1081.84kB 3.21% Envoy::Extensions::TransportSockets::Tls::ContextImpl::ContextImpl 0 0% 98.34% 1433.78kB 4.26% Envoy::Grpc::AsyncStreamCallbacks::onReceiveMessageRaw 0 0% 98.34% 1430.18kB 4.25% Envoy::Grpc::AsyncStreamImpl::onData 0 0% 98.34% 1427.35kB 4.24% Envoy::Http::AsyncStreamImpl::encodeData 0 0% 98.34% 1223kB 3.63% Envoy::Http::CodecClient::CodecReadFilter::onData 0 0% 98.34% 1277.10kB 3.79% Envoy::Http::CodecClient::onData 0 0% 98.34% 1293.62kB 3.84% Envoy::Http::Http2::ConnectionImpl::dispatch 0 0% 98.34% 1402.62kB 4.16% Envoy::Http::Http2::ConnectionImpl::onFrameReceived 0 0% 98.34% 1409.21kB 4.18% Envoy::Http::ResponseDecoderWrapper::decodeData 0 0% 98.34% 753.30kB 2.24% Envoy::MainCommonBase::run 0 0% 98.34% 1178.06kB 3.50% Envoy::Network::ConnectionImpl::onFileEvent 0 0% 98.34% 1219.07kB 3.62% Envoy::Network::ConnectionImpl::onReadReady 0 0% 98.34% 32014.61kB 95.05% Envoy::Network::FilterManagerImpl::initializeReadFilters 0 0% 98.34% 33233.24kB 98.67% Envoy::Network::FilterManagerImpl::onContinueReading 0 0% 98.34% 1420.46kB 4.22% Envoy::Router::UpstreamRequest::decodeData 0 0% 98.34% 1359.39kB 4.04% Envoy::Secret::SdsApi::onConfigUpdate 0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::onAcceptWorker 0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpSocket::continueFilterChain 0 0% 98.34% 886.31kB 2.63% Envoy::Server::InstanceImpl::run 0 0% 98.34% 32102.04kB 95.31% Envoy::Server::WorkerImpl::threadRoutine 0 0% 98.34% 220.22kB 0.65% Envoy::Ssl::TlsCertificateConfigImpl::TlsCertificateConfigImpl 0 0% 98.34% 32101.35kB 95.31% Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function):😒_0::__invoke 0 0% 98.34% 403.05kB 1.20% PEM_ASN1_read_bio 0 0% 98.34% 195.30kB 0.58% PEM_X509_INFO_read_bio 0 0% 98.34% 571.81kB 1.70% [libc-2.27.so] 0 0% 98.34% 223.61kB 0.66% asn1_d2i_ex_primitive 0 0% 98.34% 223.61kB 0.66% asn1_ex_c2i 0 0% 98.34% 175.63kB 0.52% asn1_item_ex_combine_new 0 0% 98.34% 613.34kB 1.82% asn1_item_ex_d2i 0 0% 98.34% 454.71kB 1.35% asn1_template_ex_d2i 0 0% 98.34% 454.71kB 1.35% asn1_template_noexp_d2i 0 0% 98.34% 207.76kB 0.62% d2i_X509_AUX 0 0% 98.34% 33099.96kB 98.27% event_base_loop 0 0% 98.34% 33183.10kB 98.52% event_process_active_single_queue 0 0% 98.34% 32053.88kB 95.16% listener_read_cb 0 0% 98.34% 675.96kB 2.01% main 0 0% 98.34% 1326.59kB 3.94% nghttp2_session_mem_recv 0 0% 98.34% 1401.21kB 4.16% nghttp2_session_on_data_received 0 0% 98.34% 195.30kB 0.58% parse_x509 0 0% 98.34% 32100.56kB 95.30% start_thread 0 0% 98.34% 602.94kB 1.79% std::__1::__hash_table::__node_insert_unique_prepare

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)

$istioctl version --remote client version: 1.6.2 control plane version: 1.6.2 data plane version: 1.6.2 (20 proxies)

While I was only able to dig into this after the upgrade to 1.6.2, I have observed the increased memory consumption since 1.5.2.

$kubectl version Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.3”, GitCommit:“2e7996e3e2712684bc73f0dec0200d64eec7fe40”, GitTreeState:“clean”, BuildDate:“2020-05-20T12:52:00Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“16+”, GitVersion:“v1.16.8-eks-e16311”, GitCommit:“e163110a04dcb2f39c3325af96d019b4925419eb”, GitTreeState:“clean”, BuildDate:“2020-03-27T22:37:12Z”, GoVersion:“go1.13.8”, Compiler:“gc”, Platform:“linux/amd64”}

How was Istio installed?

istioctl manifest apply -f/istioctl upgrade -f

Environment where bug was observed (cloud vendor, OS, etc)

AWS EKS 1.6 control plane and worker nodes

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 41 (19 by maintainers)

Commits related to this issue

CASMPET-4080: Updates for Istio 1.7.8 The charts are updated for Istio 1.7.8. Here's a summary of the changes. * cray-istio-operator I updated the charts/istio-operator subchart from the 1.7.8 dist... — committed to Cray-HPE/cray-istio by deleted user 3 years ago

Most upvoted comments

I believe to have run into the same leak. On our most severely affected pod, the heap pprof flamegraph looks more or less identical to the image in the initial report above. Deleting the stats filter as suggested releases the “leaked” memory as shown in this screenshot (it’s only over a 5 minute range so you can’t see the growth of the leak too well).

It is interesting to note that we only have outbound stats enabled, and sure enough the services running in our cluster that frequently open many outbound connections (e.g. non-pooled connections to the k8s API servers and various database flavors) leak much, much faster than those utilizing HTTP keep-alives and DB connection pools. This aligns with the heap flamegraph showing high allocations in the newConnection path. Those with high inbound connection open rates do not appear to leak. I think we’ve lucked out here a bit since our inbound open rate is much higher than out : ).

The memory leak is slow, about 2-5MB/hour depending on our dev, test, and production clusters. A quick, rough look via tcpdump shows a connection open rate of ~2 connections/sec.

Happy to provide further traces, tests, etc. to get to the root cause.

kppullin on Sep 9, 2020

Yes this has been fixed at 1.8.1+. It will also be fixed at 1.7.next.

bianpengyuan on Jan 18, 2021

Yes it will be in next 1.7 release.

bianpengyuan on Jan 28, 2021

is this fix shipped with 1.8.1 ?

FWIW, I deployed 1.8.1 to my environments and I no longer observe the original memory leak (with TCP metrics envoy filter enabled). But I guess @kyessenov, @PiotrSikora, and/or @bianpengyuan can give the more official word on this.

rmak-cpi on Jan 18, 2021

@Kyslik Sorry, I meant we did not catch the last 1.6 release. At the best we can backport this to Istio 1.7.

bianpengyuan on Dec 1, 2020

@Kyslik 1.6 has just shipped its last release. @kyessenov do you think it is possible to backport to 1.7?

bianpengyuan on Nov 30, 2020

@kyessenov for the service we’re seeing most affected by this issue, the opened total is slowly but continuously growing (60,000) whereas the closed_total is nearly constant and nearly zero (< 10).

kppullin on Oct 29, 2020

I managed to disable http keepalive on the server side but the leak is still occurring with the tcp connections opened metric growing much bigger compared with the tcp connections closed metric over time. At this point, I don’t think it’s productive for me to flail around guessing what the underlying problem is although I will be happy to run an instrumented build (perhaps with additional metrics for upstreamClosed vs downstreamClosed events detected by the wasm context?).

Thanks to @bianpengyuan’s suggestion, I do have a decent workaround of disabling the tcp stats plugin, but I do hope to see this issue resolved as this seems to be a fundamental mismatch in how connection lifecycle events should be handled in the wasm plugin.

rmak-cpi on Aug 18, 2020

Hi @bianpengyuan, let me try that later on as I am running an experiment disabling http keepalive, and I will report back once I’ve obtained data from disabling the tcp-stats-filter.

rmak-cpi on Aug 14, 2020