transformers: [model_utils] very slow model instantiation
For some reason I’m noticing a very slow model instantiation time.
For example to load shleifer/distill-mbart-en-ro-12-4
it takes
- 21 secs to instantiate the model
- 0.5sec to
torch.load
its weights.
If I’m not changing how the model is created and want to quickly fast forward to the area of debug how could these slow parts be cached and not rebuilt anew again and again?
But also it looks like we are doing a completely wasteful operation of init_weights, which immediately get overwritten with pretrained model weights (https://github.com/huggingface/transformers/issues/9205#issuecomment-748741195) (for the use case of pre-trained model).
(I initially made a mistake and thought that it was torch.load
that had an issue, but it’s cls(config, *model_args, **model_kwargs)
) - thank you, @sgugger - so this post has been edited to reflect reality. So if you’re joining later you can skip the comments up to https://github.com/huggingface/transformers/issues/9205#issuecomment-748722644 and continue from there)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 21 (14 by maintainers)
I’m happy to add such a featurue. It should be feasible to only initialize those layers that are not in the saved
.pt
file.It’s on my To-Do List, but still don’t think, I’ll be able to take a look within the next 2,3 weeks - sorry 😕 If you find some time for this, it would be great
@patrickvonplaten, @sgugger, @LysandreJik - could we please revisit this - working on making t5-11b train was painful - it was taking really really really long time to init the model, just to drop it and replace with pre-trained weights. Transformers is mainly about pre-trained models, so perhaps this can be made somehow configurable?
We know when a pretrained model is loaded, so why not propagate that information and let the model know it’s being loaded in pre-trained mode, so that it could skip any weight inits that are going to be replaced anyway?
And while we are at it, I don’t suppose there is a way to involve more than one CPU core in loading the model? I guess that would be a question for pytorch.
Thank you!
I totally get it that it’s not high priority, since most people don’t care for a slow start when they run it non-stop for hours - it only affects people who need a quick start - which is the case when debugging something or as I suggested the demo function on the model pages which takes a really long time to load.
In the case of BART, its deterministic segments do the init internally, so it’s enough to just monkeypatch as a proof of concept:
and this command:
goes from 25sec to 8secs. The instantiation goes from 22 secs to 5 secs.
There are few
uniform_
calls left which account for 2.3 extra secs, which if shaves off we should be down to 2-3 secs (from 22!).I quickly checked that the core functions normally - same scores - well, I did just one finetune_trainer run.
One way is to solve this as @patrickvonplaten suggested, and I’m also thinking of changing the design a bit. So that each model has a normal
init_weights
andinit_weights_pretrained
- then it’s very clear to the developer what goes where and then simply invoke one or the other depending on the context. And then it’s just a matter of choosing how to signal the context.So doing profiling on model instantiation code it can be seen that
_init_weights
is where some 75% of that slowdown happensSo we are completely wasting time doing init weights, since we are immediately replacing them. (with the exception to
SinusoidalPositionalEmbedding
which do not get loaded from the pretrained model).If you prefer the visual version:
Chances are that model init needs to be made context aware and not init weights which will be immediately replaced. Thoughts?
That would make
transformers
so much faster to start! (e.g. think the model pages website which takes forever to load a model).The profiling was done with:
Im on the same boat as @stas00 . I understand that the code need to maintain a wider compatibility across the oceans of models, but people needs a working workaround before an elegant solution born into reality. I believe as huggingface slowly graduating from pure research field, more and more people are being hurt by the tremendous model initialization time. Hoping for a change
Hello @AyeshaSarwar,
could you please use the forum: https://discuss.huggingface.co/ instead for such questions? We don’t support Flask compatibility in
transformers
. Please keep in mind that the issues are mainly used for issues related to justtransformers
.Thanks
Yeah Patrick’s suggestion is probably the best, though I’m not sure it can easily be achieved in the current API. Note that this is only one slowdown at the beginning of training, so I don’t think this should be high priority.
If we see a significant gain in loading time, maybe it’s worth to explore a way to only apply
init_weights
on missing layers. Not sure how easy it would be to implement it though…Maybe a
init_weights
function arg in__init__
might make sense: