bergamot-translator: Loading time is really slow with large thread count once again
This is identified to be a bergamot translator issue in https://github.com/XapaJIaMnu/translateLocally/issues/76.
A nice solution may involve shared model memory across worker threads (avoiding intgemm/shortlist preprocessing placing stuff in graph [needs verification]). This memory will be owned by TranslationModel. Everything transient will remain in the workspace and workspace attached to the worker. This issue is closely related to #257.
A temporary workaround provided by @jelmervdl is:
diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp
index 9d2eb0cdb73526584d53e5cc2e32facfffc9650e..753b500fea4629fde1452b67f76d5862185a1df8 100644
--- a/src/translator/translation_model.cpp
+++ b/src/translator/translation_model.cpp
@@ -45,8 +45,15 @@ TranslationModel::TranslationModel(const Config &options, MemoryBundle &&memory
}
}
+ std::vector<std::future<void>> loadCalls;
+ loadCalls.resize(replicas);
+
for (size_t idx = 0; idx < replicas; idx++) {
- loadBackend(idx);
+ loadCalls[idx] = std::async(&TranslationModel::loadBackend, this, idx);
+ }
+
+ for (auto &&loadCall : loadCalls) {
+ loadCall.wait();
}
}
Unsure about putting std::thread or std::async within TranslationModel, the threading and delegations should ideally be within Service. As part of resolving, we should ideally check-in something on var through BRT which checks model-loading speeds remain unaffected hereafter.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 17 (12 by maintainers)
Commits related to this issue
- Proposed quick fix for #293 parallel model loading — committed to browsermt/bergamot-translator by jelmervdl 2 years ago
We should put energy into solving the underlying problem by loading the model once and sharing the memory across threads, rather than kludges on top.
On January 3, 2022 2:06:30 PM UTC, Nikolay Bogoychev @.***> wrote:
It’s a bit of a misnomer in this case maybe. The problem right now is that all workers are initialised sequentially on the main thread. Delaying that initialisation till its necessary is just an easy way to move the initialisation into the worker threads. The worker threads can then all individually do their own initialisation, so the initialisation can be done in parallel.