spin: High memory consumption when running a Spin application from Bindle

For Spin applications with large number of components and a few hundred static assets, we have seen significantly increased memory allocation when running from Bindle compared to running locally. This is a meta-issue that attempts to track down the root causes of that.

TL; DR: There are three main causes for increased memory consumption when running a Spin application directly from Bindle (in no particular order):

The Bindle loader attempting to pull all static assets in parallel.
Expensive clones of core components (and applications).
Not caching compiled Wasm modules (applies to both modes of running Spin).

The Bindle loader attempting to pull all static assets in parallel

Each new Tokio thread created allocates by default 2MiB of memory on the stack.

When preparing a component from Bindle (preparing its assets), all components are handled in parallel, and all parcels are again handled in parallel.

This results in N (the total number of static assets in an application) Tokio threads, so a spike of N x 2MiB in memory when pulling the static assets. This is memory that is allocated by default, but as far as I can see, not necessarily used.

(This also results in errors because of too many open files / connections in multiple OSes.)

Attached is the stack trace for preparing the application (there are ~200 Tokio threads that I can see spawned to handle static assets for Fermyon.com, which seems to roughly match the number of assets we have):

std::sys::unix::thread::Thread::new::h56395654af139df9
std::thread::Builder::spawn::hcf8ef13595f288d8
tokio::runtime::blocking::pool::Spawner::spawn::he31007656ba14ed0
tokio::runtime::handle::Handle::spawn_blocking::hb7a1d001cd691f95
_$LT$hyper..client..connect..dns..GaiResolver$u20$as$u20$tower_service..Service$LT$hyper..client..connect..dns..Name$GT$$GT$::call::h85c1e5c145e062be
bindle::client::Client$LT$T$GT$::get_parcel_request::_$u7b$$u7b$closure$u7d$$u7d$::hd2dbfcf845ca9e75
bindle::client::Client$LT$T$GT$::get_parcel::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hba859fa9b43a059e
bindle::client::Client$LT$T$GT$::get_parcel::_$u7b$$u7b$closure$u7d$$u7d$::h511ae0d5b4c503a9
spin_loader::bindle::utils::BindleReader::get_parcel::_$u7b$$u7b$closure$u7d$$u7d$::hb5541ecd08ef080e
spin_loader::bindle::assets::Copier::copy::_$u7b$$u7b$closure$u7d$$u7d$::hb02becfe90313a13
futures_util::stream::stream::StreamExt::poll_next_unpin::h6a1590c214a1659c
spin_loader::bindle::assets::Copier::copy_all::_$u7b$$u7b$closure$u7d$$u7d$::hde87af7685d826bd
spin_loader::bindle::assets::Copier::prepare::_$u7b$$u7b$closure$u7d$$u7d$::hdbb711c679042dea
spin_loader::bindle::assets::prepare_component::_$u7b$$u7b$closure$u7d$$u7d$::h07cd2dcbded2b56c
spin_loader::bindle::core::_$u7b$$u7b$closure$u7d$$u7d$::hb3927d3d00f58690
spin_loader::bindle::prepare::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h88ad1894adcc20dd
spin_loader::bindle::prepare::_$u7b$$u7b$closure$u7d$$u7d$::h0c451f7f4ce8ba1b
spin_loader::bindle::from_bindle::_$u7b$$u7b$closure$u7d$$u7d$::h1602d377e64702b6
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
main
start

The impact of the total stack allocations alone when running the same application from a Bindle vs. when running locally: ~430 MiB vs. ~190 MiB (these are the total allocations, not the actual memory used).

Expensive clones of core components (and applications)

Another area with a significant memory difference when running from a Bindle vs. when running locally is the total heap allocation: ~32 MiB vs. ~1 MiB. That seems to be entirely from cloning either CoreComponent or Application. The difference when running from Bindle is that ModuleSource here actually contains the bytes of the module, so cloning is expensive.

Stack traces for the 3 heap allocations of each Wasm module source when running from Bindle. All three allocations happen right at startup:

first, this is related to cloning the module source (when running from Bindle, a core component contains the actual bytes of the module source, so cloning it becomes expensive) (https://github.com/fermyon/spin/blob/469925b356758870705b0b07b02682491fcfdfad/crates/engine/src/lib.rs#L135-L136):

alloc::alloc::alloc::h541a9e04cf5fdc0c
alloc::alloc::Global::alloc_impl::h4c4d9c70eb057e52
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::hfd91f8b7cc4a5842
alloc::raw_vec::RawVec$LT$T$C$A$GT$::allocate_in::h79767559d8cf9dbb
alloc::raw_vec::RawVec$LT$T$C$A$GT$::with_capacity_in::h4fb2d6c0fe1c85e7
alloc::vec::Vec$LT$T$C$A$GT$::with_capacity_in::h864a9d0ff8c13988
_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$::to_vec::h2df6bf7f0e171ea5
alloc::slice::hack::to_vec::h9956846035169e8a
alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec_in::h62d3afebc7e96389
_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h9a341bc7552379fc
_$LT$spin_manifest..ModuleSource$u20$as$u20$core..clone..Clone$GT$::clone::ha7acc08eafa20952
_$LT$spin_manifest..CoreComponent$u20$as$u20$core..clone..Clone$GT$::clone::h78c24d9a6327f395
spin_engine::Builder$LT$T$GT$::build::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hbf768ace320551c2
spin_engine::Builder$LT$T$GT$::build::_$u7b$$u7b$closure$u7d$$u7d$::hb20fb4e80940b7a6
spin_http_engine::HttpTrigger::new::_$u7b$$u7b$closure$u7d$$u7d$::hf23f1f5ed4b98c20
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
core::ops::function::FnOnce::call_once::h945d8dbe5199763c
std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9a74031b838d6d5d
std::rt::lang_start_internal::h358b6d58e23c88c7
main
start

second, this is because Router owns a map of core components, and when pulling from Bindle, the actual core component contains the bytes, so the router owning the data results in cloning the core components (https://github.com/fermyon/spin/blob/main/crates/http/src/routes.rs#40):

_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::hdfa911e275493235
_$LT$indexmap..map..IndexMap$LT$K$C$V$C$S$GT$$u20$as$u20$core..iter..traits..collect..FromIterator$LT$$LP$K$C$V$RP$$GT$$GT$::from_iter::hcd3735063c1d5516
spin_http_engine::routes::Router::build::h0d556652e880cdef
spin_http_engine::HttpTrigger::new::_$u7b$$u7b$closure$u7d$$u7d$::hf23f1f5ed4b98c20
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
core::ops::function::FnOnce::call_once::h945d8dbe5199763c
std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9a74031b838d6d5d
std::rt::lang_start_internal::h358b6d58e23c88c7
main
start

finally, same issue with cloning the module bytes, this time right before starting the trigger, when preparing the context builder (https://github.com/fermyon/spin/blob/main/src/commands/up.rs#L140):

alloc::alloc::alloc::h541a9e04cf5fdc0c
alloc::alloc::Global::alloc_impl::h4c4d9c70eb057e52
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::hfd91f8b7cc4a5842
alloc::raw_vec::RawVec$LT$T$C$A$GT$::allocate_in::h79767559d8cf9dbb
alloc::raw_vec::RawVec$LT$T$C$A$GT$::with_capacity_in::h4fb2d6c0fe1c85e7
alloc::vec::Vec$LT$T$C$A$GT$::with_capacity_in::h864a9d0ff8c13988
_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$::to_vec::h2df6bf7f0e171ea5
alloc::slice::hack::to_vec::h9956846035169e8a
alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec_in::h62d3afebc7e96389
_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h9a341bc7552379fc
_$LT$spin_manifest..ModuleSource$u20$as$u20$core..clone..Clone$GT$::clone::ha7acc08eafa20952
_$LT$spin_manifest..CoreComponent$u20$as$u20$core..clone..Clone$GT$::clone::h78c24d9a6327f395
_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$::to_vec::h4d0ef174b9ed8022
alloc::slice::hack::to_vec::he6b0a08f94435106
alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec_in::h63a19299e1e4be18
_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h07f481c3a6d87aae
_$LT$spin_manifest..Application$LT$T$GT$$u20$as$u20$core..clone..Clone$GT$::clone::haf4884707ed35218
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
core::ops::function::FnOnce::call_once::h945d8dbe5199763c
std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9a74031b838d6d5d
std::rt::lang_start_internal::h358b6d58e23c88c7
main
start

Not caching compiled Wasm modules (applies to both modes of running Spin)

Not caching compiled modules results in an initial spike in memory (and latency for “cold starts”), both when running locally and when running from Bindle. The memory and latency impact from caching has not yet been measured.

If anyone wants to explore the memory profiling, I can share the two files that can be opened with XCode’s Instruments (generated with cargo-instruments), but they can be generated using the CLI repeatedly with different app sources and sending load for 10 seconds:

$ cargo instruments -t Allocations --release --time-limit 10000 --open -- up --server http://localhost:8080/v1 --bindle your-app/0.1.0

Once we discuss potential solutions, we can start opening individual issues and PRs.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 22 (22 by maintainers)

Commits related to this issue

perf: reduce peak memory usage and open files when loading bindle This patch does two things: 1. Use `bindle::client::Client::get_parcel_stream` instead of `get_parcel`. The latter loads the whole ... — committed to dicej/spin by dicej 2 years ago
perf: reduce peak memory usage and open files when loading bindle This patch does two things: 1. Use `bindle::client::Client::get_parcel_stream` instead of `get_parcel`. The latter loads the whole ... — committed to dicej/spin by dicej 2 years ago
perf: reduce peak memory usage and open files when loading bindle This patch does two things: 1. Use `bindle::client::Client::get_parcel_stream` instead of `get_parcel`. The latter loads the whole ... — committed to dicej/spin by dicej 2 years ago

Most upvoted comments

get_parcel_stream is working nicely – will make a PR soon.

I’m thinking we should limit concurrency anyway, just to be kind to the Bindle server. That should also help address the “too many open files” issue mentioned here.

dicej on May 4, 2022

Ah, apologies for misunderstanding.

It does feel like copies could be streaming from HTTP to the destination file, without being fully loaded into memory. That would probably need an update to the Bindle client library, but that would be no bad thing.

(this is not to rule out capping concurrency instead/as well)

itowlson on May 4, 2022

Quick update: I’ve updated my test to make sure the HTTP trigger component actually reads all 200 assets each time it is invoked. I’ve also run siege against it for a few minutes, and memory never exceeded 12MB (for assets of about 9KB each).

Then I tried increasing the size of each asset to 1MB and hit it with siege for a minute. The memory was 205M that time, peaking within a second of startup (i.e. memory peaked before siege even started hitting it). Most of that was from:

main in spin
std::rt::lang_start_internal::h2ba92edce36c035e in spin
std::panic::catch_unwind::h3bd49b5a5dfb1a50 in spin
std::panicking::try::h13e2d225134958ac in spin
std::panicking::try::do_call::hfb39d6df61a2e69f in spin
std::rt::lang_start_internal::_$u7b$$u7b$closure$u7d$$u7d$::hf006f2bc7ce22bbe in spin
std::panic::catch_unwind::h9d739f9f59895e68 in spin
std::panicking::try::h653d68a27ff5f175 in spin
std::panicking::try::do_call::h7bc9dc436daeb8c7 in spin
core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnOnce$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_once::hb7014f43484a8b4e in spin
<unresolved function> in spin
std::sys_common::backtrace::__rust_begin_short_backtrace::hf22d5a444404e407 in spin
spin::main::hf2f2d3cc46d30959 in spin
tokio::runtime::Runtime::block_on::hcb2f44af96b75a34 in spin
tokio::runtime::thread_pool::ThreadPool::block_on::h220ae8613adba5e9 in spin
tokio::park::thread::CachedParkThread::block_on::h65d16a2f2e4d3371 in spin
_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h6d7a129d48075d95 in spin
spin::SpinApp::run::_$u7b$$u7b$closure$u7d$$u7d$::h94de6335eb2e0416 in spin
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h51f2bb6d813783ab in spin
_$LT$futures_util..future..join_all..JoinAll$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h3ee4397b01d0bbf7 in spin
_$LT$futures_util..future..maybe_done..MaybeDone$LT$Fut$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hf3aa0a17339cb2f8 in spin
_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h60b64d0a7ae2d4d8 in spin
_$LT$futures_util..future..join_all..JoinAll$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h6f8911603c61d8b9 in spin
_$LT$futures_util..stream..futures_ordered..FuturesOrdered$LT$Fut$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::haed231cd4eddddc2 in spin
_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h575ff98f72497df6 in spin
_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hfa8b191a69a5db52 in spin
_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::ha20807783370ba28 in spin
_$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h0d90325371129ab6 in spin

200+MB does seem unnecessarily high. The innermost functions in that stack trace appear to be compiler-generated Futures from async blocks and/or functions. I’m going to try again with a debug build to see if I can get some line numbers.

dicej on May 4, 2022