spin: High memory consumption when running a Spin application from Bindle
For Spin applications with large number of components and a few hundred static assets, we have seen significantly increased memory allocation when running from Bindle compared to running locally. This is a meta-issue that attempts to track down the root causes of that.
TL; DR: There are three main causes for increased memory consumption when running a Spin application directly from Bindle (in no particular order):
- The Bindle loader attempting to pull all static assets in parallel.
- Expensive clones of core components (and applications).
- Not caching compiled Wasm modules (applies to both modes of running Spin).
- The Bindle loader attempting to pull all static assets in parallel
Each new Tokio thread created allocates by default 2MiB of memory on the stack.
When preparing a component from Bindle (preparing its assets), all components are handled in parallel, and all parcels are again handled in parallel.
This results in N (the total number of static assets in an application) Tokio threads, so a spike of N x 2MiB in memory when pulling the static assets. This is memory that is allocated by default, but as far as I can see, not necessarily used.
(This also results in errors because of too many open files / connections in multiple OSes.)
Attached is the stack trace for preparing the application (there are ~200 Tokio threads that I can see spawned to handle static assets for Fermyon.com, which seems to roughly match the number of assets we have):
std::sys::unix::thread::Thread::new::h56395654af139df9
std::thread::Builder::spawn::hcf8ef13595f288d8
tokio::runtime::blocking::pool::Spawner::spawn::he31007656ba14ed0
tokio::runtime::handle::Handle::spawn_blocking::hb7a1d001cd691f95
_$LT$hyper..client..connect..dns..GaiResolver$u20$as$u20$tower_service..Service$LT$hyper..client..connect..dns..Name$GT$$GT$::call::h85c1e5c145e062be
bindle::client::Client$LT$T$GT$::get_parcel_request::_$u7b$$u7b$closure$u7d$$u7d$::hd2dbfcf845ca9e75
bindle::client::Client$LT$T$GT$::get_parcel::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hba859fa9b43a059e
bindle::client::Client$LT$T$GT$::get_parcel::_$u7b$$u7b$closure$u7d$$u7d$::h511ae0d5b4c503a9
spin_loader::bindle::utils::BindleReader::get_parcel::_$u7b$$u7b$closure$u7d$$u7d$::hb5541ecd08ef080e
spin_loader::bindle::assets::Copier::copy::_$u7b$$u7b$closure$u7d$$u7d$::hb02becfe90313a13
futures_util::stream::stream::StreamExt::poll_next_unpin::h6a1590c214a1659c
spin_loader::bindle::assets::Copier::copy_all::_$u7b$$u7b$closure$u7d$$u7d$::hde87af7685d826bd
spin_loader::bindle::assets::Copier::prepare::_$u7b$$u7b$closure$u7d$$u7d$::hdbb711c679042dea
spin_loader::bindle::assets::prepare_component::_$u7b$$u7b$closure$u7d$$u7d$::h07cd2dcbded2b56c
spin_loader::bindle::core::_$u7b$$u7b$closure$u7d$$u7d$::hb3927d3d00f58690
spin_loader::bindle::prepare::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h88ad1894adcc20dd
spin_loader::bindle::prepare::_$u7b$$u7b$closure$u7d$$u7d$::h0c451f7f4ce8ba1b
spin_loader::bindle::from_bindle::_$u7b$$u7b$closure$u7d$$u7d$::h1602d377e64702b6
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
main
start
The impact of the total stack allocations alone when running the same application from a Bindle vs. when running locally: ~430 MiB vs. ~190 MiB (these are the total allocations, not the actual memory used).
- Expensive clones of core components (and applications)
Another area with a significant memory difference when running from a Bindle vs.
when running locally is the total heap allocation: ~32 MiB vs. ~1 MiB. That
seems to be entirely from cloning either CoreComponent
or Application
. The
difference when running from Bindle is that ModuleSource
here actually contains
the bytes of the module, so cloning is expensive.
Stack traces for the 3 heap allocations of each Wasm module source when running from Bindle. All three allocations happen right at startup:
- first, this is related to cloning the module source (when running from Bindle, a core component contains the actual bytes of the module source, so cloning it becomes expensive) (https://github.com/fermyon/spin/blob/469925b356758870705b0b07b02682491fcfdfad/crates/engine/src/lib.rs#L135-L136):
alloc::alloc::alloc::h541a9e04cf5fdc0c
alloc::alloc::Global::alloc_impl::h4c4d9c70eb057e52
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::hfd91f8b7cc4a5842
alloc::raw_vec::RawVec$LT$T$C$A$GT$::allocate_in::h79767559d8cf9dbb
alloc::raw_vec::RawVec$LT$T$C$A$GT$::with_capacity_in::h4fb2d6c0fe1c85e7
alloc::vec::Vec$LT$T$C$A$GT$::with_capacity_in::h864a9d0ff8c13988
_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$::to_vec::h2df6bf7f0e171ea5
alloc::slice::hack::to_vec::h9956846035169e8a
alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec_in::h62d3afebc7e96389
_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h9a341bc7552379fc
_$LT$spin_manifest..ModuleSource$u20$as$u20$core..clone..Clone$GT$::clone::ha7acc08eafa20952
_$LT$spin_manifest..CoreComponent$u20$as$u20$core..clone..Clone$GT$::clone::h78c24d9a6327f395
spin_engine::Builder$LT$T$GT$::build::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hbf768ace320551c2
spin_engine::Builder$LT$T$GT$::build::_$u7b$$u7b$closure$u7d$$u7d$::hb20fb4e80940b7a6
spin_http_engine::HttpTrigger::new::_$u7b$$u7b$closure$u7d$$u7d$::hf23f1f5ed4b98c20
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
core::ops::function::FnOnce::call_once::h945d8dbe5199763c
std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9a74031b838d6d5d
std::rt::lang_start_internal::h358b6d58e23c88c7
main
start
- second, this is because Router owns a map of core components, and when pulling from Bindle, the actual core component contains the bytes, so the router owning the data results in cloning the core components (https://github.com/fermyon/spin/blob/main/crates/http/src/routes.rs#40):
_$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::hdfa911e275493235
_$LT$indexmap..map..IndexMap$LT$K$C$V$C$S$GT$$u20$as$u20$core..iter..traits..collect..FromIterator$LT$$LP$K$C$V$RP$$GT$$GT$::from_iter::hcd3735063c1d5516
spin_http_engine::routes::Router::build::h0d556652e880cdef
spin_http_engine::HttpTrigger::new::_$u7b$$u7b$closure$u7d$$u7d$::hf23f1f5ed4b98c20
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
core::ops::function::FnOnce::call_once::h945d8dbe5199763c
std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9a74031b838d6d5d
std::rt::lang_start_internal::h358b6d58e23c88c7
main
start
- finally, same issue with cloning the module bytes, this time right before starting the trigger, when preparing the context builder (https://github.com/fermyon/spin/blob/main/src/commands/up.rs#L140):
alloc::alloc::alloc::h541a9e04cf5fdc0c
alloc::alloc::Global::alloc_impl::h4c4d9c70eb057e52
_$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::hfd91f8b7cc4a5842
alloc::raw_vec::RawVec$LT$T$C$A$GT$::allocate_in::h79767559d8cf9dbb
alloc::raw_vec::RawVec$LT$T$C$A$GT$::with_capacity_in::h4fb2d6c0fe1c85e7
alloc::vec::Vec$LT$T$C$A$GT$::with_capacity_in::h864a9d0ff8c13988
_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$::to_vec::h2df6bf7f0e171ea5
alloc::slice::hack::to_vec::h9956846035169e8a
alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec_in::h62d3afebc7e96389
_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h9a341bc7552379fc
_$LT$spin_manifest..ModuleSource$u20$as$u20$core..clone..Clone$GT$::clone::ha7acc08eafa20952
_$LT$spin_manifest..CoreComponent$u20$as$u20$core..clone..Clone$GT$::clone::h78c24d9a6327f395
_$LT$T$u20$as$u20$alloc..slice..hack..ConvertVec$GT$::to_vec::h4d0ef174b9ed8022
alloc::slice::hack::to_vec::he6b0a08f94435106
alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec_in::h63a19299e1e4be18
_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as$u20$core..clone..Clone$GT$::clone::h07f481c3a6d87aae
_$LT$spin_manifest..Application$LT$T$GT$$u20$as$u20$core..clone..Clone$GT$::clone::haf4884707ed35218
spin_cli::commands::up::UpCommand::run::_$u7b$$u7b$closure$u7d$$u7d$::h1a6fcbb0cca40242
tokio::park::thread::CachedParkThread::block_on::_$u7b$$u7b$closure$u7d$$u7d$::h04fc57da866dcd3a
tokio::runtime::enter::Enter::block_on::h41e73e4fe1a42bf7
tokio::runtime::Runtime::block_on::h4fcba5ea71860675
spin::main::h21ccb5f8c0ccfb14
core::ops::function::FnOnce::call_once::h945d8dbe5199763c
std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h9a74031b838d6d5d
std::rt::lang_start_internal::h358b6d58e23c88c7
main
start
- Not caching compiled Wasm modules (applies to both modes of running Spin)
Not caching compiled modules results in an initial spike in memory (and latency for “cold starts”), both when running locally and when running from Bindle. The memory and latency impact from caching has not yet been measured.
If anyone wants to explore the memory profiling, I can share the two
files that can be opened with XCode’s Instruments (generated with
cargo-instruments
), but they can be
generated using the CLI repeatedly with different app sources and sending load for 10 seconds:
$ cargo instruments -t Allocations --release --time-limit 10000 --open -- up --server http://localhost:8080/v1 --bindle your-app/0.1.0
Once we discuss potential solutions, we can start opening individual issues and PRs.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (22 by maintainers)
Commits related to this issue
- perf: reduce peak memory usage and open files when loading bindle This patch does two things: 1. Use `bindle::client::Client::get_parcel_stream` instead of `get_parcel`. The latter loads the whole ... — committed to dicej/spin by dicej 2 years ago
- perf: reduce peak memory usage and open files when loading bindle This patch does two things: 1. Use `bindle::client::Client::get_parcel_stream` instead of `get_parcel`. The latter loads the whole ... — committed to dicej/spin by dicej 2 years ago
- perf: reduce peak memory usage and open files when loading bindle This patch does two things: 1. Use `bindle::client::Client::get_parcel_stream` instead of `get_parcel`. The latter loads the whole ... — committed to dicej/spin by dicej 2 years ago
get_parcel_stream
is working nicely – will make a PR soon.I’m thinking we should limit concurrency anyway, just to be kind to the Bindle server. That should also help address the “too many open files” issue mentioned here.
Ah, apologies for misunderstanding.
It does feel like copies could be streaming from HTTP to the destination file, without being fully loaded into memory. That would probably need an update to the Bindle client library, but that would be no bad thing.
(this is not to rule out capping concurrency instead/as well)
Quick update: I’ve updated my test to make sure the HTTP trigger component actually reads all 200 assets each time it is invoked. I’ve also run
siege
against it for a few minutes, and memory never exceeded 12MB (for assets of about 9KB each).Then I tried increasing the size of each asset to 1MB and hit it with siege for a minute. The memory was
205M
that time, peaking within a second of startup (i.e. memory peaked before siege even started hitting it). Most of that was from:200+MB does seem unnecessarily high. The innermost functions in that stack trace appear to be compiler-generated Futures from async blocks and/or functions. I’m going to try again with a debug build to see if I can get some line numbers.