istio: Potential memory leak from WASM NullPlugin onNetworkNewConnection
Bug description Istio-proxy is consuming increasing amount of memory for an application with moderate HTTP traffic (10-20 qps)
[ ] Configuration Infrastructure [ ] Docs [ ] Installation [ ] Networking [x ] Performance and Scalability [ ] Policies and Telemetry [ ] Security [ ] Test and Release [ ] User Experience [ ] Developer Infrastructure
Expected behavior Istio-proxy memory should stay relatively constant over time.
Steps to reproduce the bug Here is a heap profile flame graph after service has been up for 12+ hours:
The flame graph seems to indicate that the heap has been accumulating blocks allocated from operator() calls through the NullPlugin although I am at a bit of a loss to figure out a workaround from reading the relevant source code.
In case it’s helpful, here is the text version of the heap profile:
File: envoy Type: inuse_space Showing nodes accounting for 33124.62kB, 98.34% of 33682.43kB total Dropped 482 nodes (cum <= 168.41kB) flat flat% sum% cum cum% 27546.82kB 81.78% 81.78% 33411.07kB 99.19% std::__1::__function::__func::operator() 1816.88kB 5.39% 87.18% 2421.37kB 7.19% std::__1::__hash_table::__emplace_unique_key_args 1449.59kB 4.30% 91.48% 2052.53kB 6.09% std::__1::__hash_table::__emplace_unique_impl 1206.61kB 3.58% 95.06% 1206.61kB 3.58% std::__1::__hash_table::__rehash 819.49kB 2.43% 97.50% 819.49kB 2.43% OPENSSL_malloc 246.34kB 0.73% 98.23% 246.34kB 0.73% std::__1::basic_string::basic_string 28.98kB 0.086% 98.31% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ContextManagerImpl::createSslClientContext 7.83kB 0.023% 98.34% 32049.05kB 95.15% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::newConnection 2.06kB 0.0061% 98.34% 32053.88kB 95.16% Envoy::Network::ListenerImpl::listenCallback 0 0% 98.34% 613.34kB 1.82% ASN1_item_d2i 0 0% 98.34% 1439.37kB 4.27% Envoy::Config::GrpcMuxImpl::onDiscoveryResponse 0 0% 98.34% 1359.13kB 4.04% Envoy::Config::GrpcSubscriptionImpl::onConfigUpdate 0 0% 98.34% 1144.74kB 3.40% Envoy::Event::FileEventImpl::assignEvents(unsigned int, event_base*):😒_0::__invoke 0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Context::onCreate 0 0% 98.34% 32009.40kB 95.03% Envoy::Extensions::Common::Wasm::Context::onNetworkNewConnection 0 0% 98.34% 29594.81kB 87.86% Envoy::Extensions::Common::Wasm::Null::NullPlugin::ensureContext 0 0% 98.34% 2414.76kB 7.17% Envoy::Extensions::Common::Wasm::Null::Plugin::Stats::PluginContext::onNewConnection 0 0% 98.34% 992.35kB 2.95% Envoy::Extensions::TransportSockets::Tls::ClientContextImpl::ClientContextImpl 0 0% 98.34% 1025.07kB 3.04% Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::onAddOrUpdateSecret 0 0% 98.34% 1081.84kB 3.21% Envoy::Extensions::TransportSockets::Tls::ContextImpl::ContextImpl 0 0% 98.34% 1433.78kB 4.26% Envoy::Grpc::AsyncStreamCallbacks::onReceiveMessageRaw 0 0% 98.34% 1430.18kB 4.25% Envoy::Grpc::AsyncStreamImpl::onData 0 0% 98.34% 1427.35kB 4.24% Envoy::Http::AsyncStreamImpl::encodeData 0 0% 98.34% 1223kB 3.63% Envoy::Http::CodecClient::CodecReadFilter::onData 0 0% 98.34% 1277.10kB 3.79% Envoy::Http::CodecClient::onData 0 0% 98.34% 1293.62kB 3.84% Envoy::Http::Http2::ConnectionImpl::dispatch 0 0% 98.34% 1402.62kB 4.16% Envoy::Http::Http2::ConnectionImpl::onFrameReceived 0 0% 98.34% 1409.21kB 4.18% Envoy::Http::ResponseDecoderWrapper::decodeData 0 0% 98.34% 753.30kB 2.24% Envoy::MainCommonBase::run 0 0% 98.34% 1178.06kB 3.50% Envoy::Network::ConnectionImpl::onFileEvent 0 0% 98.34% 1219.07kB 3.62% Envoy::Network::ConnectionImpl::onReadReady 0 0% 98.34% 32014.61kB 95.05% Envoy::Network::FilterManagerImpl::initializeReadFilters 0 0% 98.34% 33233.24kB 98.67% Envoy::Network::FilterManagerImpl::onContinueReading 0 0% 98.34% 1420.46kB 4.22% Envoy::Router::UpstreamRequest::decodeData 0 0% 98.34% 1359.39kB 4.04% Envoy::Secret::SdsApi::onConfigUpdate 0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpListener::onAcceptWorker 0 0% 98.34% 32050.53kB 95.16% Envoy::Server::ConnectionHandlerImpl::ActiveTcpSocket::continueFilterChain 0 0% 98.34% 886.31kB 2.63% Envoy::Server::InstanceImpl::run 0 0% 98.34% 32102.04kB 95.31% Envoy::Server::WorkerImpl::threadRoutine 0 0% 98.34% 220.22kB 0.65% Envoy::Ssl::TlsCertificateConfigImpl::TlsCertificateConfigImpl 0 0% 98.34% 32101.35kB 95.31% Envoy::Thread::ThreadImplPosix::ThreadImplPosix(std::__1::function):😒_0::__invoke 0 0% 98.34% 403.05kB 1.20% PEM_ASN1_read_bio 0 0% 98.34% 195.30kB 0.58% PEM_X509_INFO_read_bio 0 0% 98.34% 571.81kB 1.70% [libc-2.27.so] 0 0% 98.34% 223.61kB 0.66% asn1_d2i_ex_primitive 0 0% 98.34% 223.61kB 0.66% asn1_ex_c2i 0 0% 98.34% 175.63kB 0.52% asn1_item_ex_combine_new 0 0% 98.34% 613.34kB 1.82% asn1_item_ex_d2i 0 0% 98.34% 454.71kB 1.35% asn1_template_ex_d2i 0 0% 98.34% 454.71kB 1.35% asn1_template_noexp_d2i 0 0% 98.34% 207.76kB 0.62% d2i_X509_AUX 0 0% 98.34% 33099.96kB 98.27% event_base_loop 0 0% 98.34% 33183.10kB 98.52% event_process_active_single_queue 0 0% 98.34% 32053.88kB 95.16% listener_read_cb 0 0% 98.34% 675.96kB 2.01% main 0 0% 98.34% 1326.59kB 3.94% nghttp2_session_mem_recv 0 0% 98.34% 1401.21kB 4.16% nghttp2_session_on_data_received 0 0% 98.34% 195.30kB 0.58% parse_x509 0 0% 98.34% 32100.56kB 95.30% start_thread 0 0% 98.34% 602.94kB 1.79% std::__1::__hash_table::__node_insert_unique_prepare
Version (include the output of istioctl version --remote
and kubectl version
and helm version
if you used Helm)
$istioctl version --remote client version: 1.6.2 control plane version: 1.6.2 data plane version: 1.6.2 (20 proxies)
While I was only able to dig into this after the upgrade to 1.6.2, I have observed the increased memory consumption since 1.5.2.
$kubectl version Client Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.3”, GitCommit:“2e7996e3e2712684bc73f0dec0200d64eec7fe40”, GitTreeState:“clean”, BuildDate:“2020-05-20T12:52:00Z”, GoVersion:“go1.13.9”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“16+”, GitVersion:“v1.16.8-eks-e16311”, GitCommit:“e163110a04dcb2f39c3325af96d019b4925419eb”, GitTreeState:“clean”, BuildDate:“2020-03-27T22:37:12Z”, GoVersion:“go1.13.8”, Compiler:“gc”, Platform:“linux/amd64”}
How was Istio installed?
istioctl manifest apply -f/istioctl upgrade -f
Environment where bug was observed (cloud vendor, OS, etc)
AWS EKS 1.6 control plane and worker nodes
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 41 (19 by maintainers)
I believe to have run into the same leak. On our most severely affected pod, the heap pprof flamegraph looks more or less identical to the image in the initial report above. Deleting the stats filter as suggested releases the “leaked” memory as shown in this screenshot (it’s only over a 5 minute range so you can’t see the growth of the leak too well).
It is interesting to note that we only have
outbound
stats enabled, and sure enough the services running in our cluster that frequently open many outbound connections (e.g. non-pooled connections to the k8s API servers and various database flavors) leak much, much faster than those utilizing HTTP keep-alives and DB connection pools. This aligns with the heap flamegraph showing high allocations in thenewConnection
path. Those with highinbound
connection open rates do not appear to leak. I think we’ve lucked out here a bit since our inbound open rate is much higher than out : ).The memory leak is slow, about 2-5MB/hour depending on our dev, test, and production clusters. A quick, rough look via tcpdump shows a connection open rate of ~2 connections/sec.
Happy to provide further traces, tests, etc. to get to the root cause.
Yes this has been fixed at 1.8.1+. It will also be fixed at 1.7.next.
Yes it will be in next 1.7 release.
FWIW, I deployed 1.8.1 to my environments and I no longer observe the original memory leak (with TCP metrics envoy filter enabled). But I guess @kyessenov, @PiotrSikora, and/or @bianpengyuan can give the more official word on this.
@Kyslik Sorry, I meant we did not catch the last 1.6 release. At the best we can backport this to Istio 1.7.
@Kyslik 1.6 has just shipped its last release. @kyessenov do you think it is possible to backport to 1.7?
@kyessenov for the service we’re seeing most affected by this issue, the opened total is slowly but continuously growing (60,000) whereas the closed_total is nearly constant and nearly zero (< 10).
I managed to disable http keepalive on the server side but the leak is still occurring with the tcp connections opened metric growing much bigger compared with the tcp connections closed metric over time. At this point, I don’t think it’s productive for me to flail around guessing what the underlying problem is although I will be happy to run an instrumented build (perhaps with additional metrics for upstreamClosed vs downstreamClosed events detected by the wasm context?).
Thanks to @bianpengyuan’s suggestion, I do have a decent workaround of disabling the tcp stats plugin, but I do hope to see this issue resolved as this seems to be a fundamental mismatch in how connection lifecycle events should be handled in the wasm plugin.
Hi @bianpengyuan, let me try that later on as I am running an experiment disabling http keepalive, and I will report back once I’ve obtained data from disabling the tcp-stats-filter.