transformers: [model_utils] very slow model instantiation

For some reason I’m noticing a very slow model instantiation time.

For example to load shleifer/distill-mbart-en-ro-12-4 it takes

  • 21 secs to instantiate the model
  • 0.5sec to torch.load its weights.

If I’m not changing how the model is created and want to quickly fast forward to the area of debug how could these slow parts be cached and not rebuilt anew again and again?

But also it looks like we are doing a completely wasteful operation of init_weights, which immediately get overwritten with pretrained model weights (https://github.com/huggingface/transformers/issues/9205#issuecomment-748741195) (for the use case of pre-trained model).

(I initially made a mistake and thought that it was torch.load that had an issue, but it’s cls(config, *model_args, **model_kwargs)) - thank you, @sgugger - so this post has been edited to reflect reality. So if you’re joining later you can skip the comments up to https://github.com/huggingface/transformers/issues/9205#issuecomment-748722644 and continue from there)

@patrickvonplaten, @sgugger, @LysandreJik

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 21 (14 by maintainers)

Most upvoted comments

I’m happy to add such a featurue. It should be feasible to only initialize those layers that are not in the saved .pt file.

It’s on my To-Do List, but still don’t think, I’ll be able to take a look within the next 2,3 weeks - sorry 😕 If you find some time for this, it would be great

@patrickvonplaten, @sgugger, @LysandreJik - could we please revisit this - working on making t5-11b train was painful - it was taking really really really long time to init the model, just to drop it and replace with pre-trained weights. Transformers is mainly about pre-trained models, so perhaps this can be made somehow configurable?

We know when a pretrained model is loaded, so why not propagate that information and let the model know it’s being loaded in pre-trained mode, so that it could skip any weight inits that are going to be replaced anyway?

And while we are at it, I don’t suppose there is a way to involve more than one CPU core in loading the model? I guess that would be a question for pytorch.

Thank you!

I totally get it that it’s not high priority, since most people don’t care for a slow start when they run it non-stop for hours - it only affects people who need a quick start - which is the case when debugging something or as I suggested the demo function on the model pages which takes a really long time to load.

In the case of BART, its deterministic segments do the init internally, so it’s enough to just monkeypatch as a proof of concept:

        # modeling_utils.py::from_pretrained
        init_weights_orig = PreTrainedModel.init_weights
        def init_weights_pretrained(self):
            # self.apply(self._init_weights)
            if self.config.pruned_heads: self.prune_heads(self.config.pruned_heads)
            self.tie_weights()
            
        PreTrainedModel.init_weights = init_weights_pretrained
        model = cls(config, *model_args, **model_kwargs)
        PreTrainedModel.init_weights = init_weights_orig

and this command:

PYTHONPATH=../../src USE_TF=0 time python -c 'from transformers import AutoModelForSeq2SeqLM; AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distill-mbart-en-ro-12-4")'

goes from 25sec to 8secs. The instantiation goes from 22 secs to 5 secs.

There are few uniform_ calls left which account for 2.3 extra secs, which if shaves off we should be down to 2-3 secs (from 22!).

I quickly checked that the core functions normally - same scores - well, I did just one finetune_trainer run.

One way is to solve this as @patrickvonplaten suggested, and I’m also thinking of changing the design a bit. So that each model has a normal init_weights and init_weights_pretrained - then it’s very clear to the developer what goes where and then simply invoke one or the other depending on the context. And then it’s just a matter of choosing how to signal the context.

So doing profiling on model instantiation code it can be seen that _init_weights is where some 75% of that slowdown happens

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      354   18.942    0.054   18.942    0.054 {method 'normal_' of 'torch._C._TensorBase' objects}
      225    2.286    0.010    2.286    0.010 {method 'uniform_' of 'torch._C._TensorBase' objects}

So we are completely wasting time doing init weights, since we are immediately replacing them. (with the exception to SinusoidalPositionalEmbedding which do not get loaded from the pretrained model).

If you prefer the visual version:

snapshot_2

Chances are that model init needs to be made context aware and not init weights which will be immediately replaced. Thoughts?

That would make transformers so much faster to start! (e.g. think the model pages website which takes forever to load a model).

The profiling was done with:

# prep
pip install graphviz gprof2dot
cat <<EOT > prog
from transformers import AutoModelForSeq2SeqLM
AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distill-mbart-en-ro-12-4")
EOT

# text profile
USE_TF=0 PYTHONPATH=src python -m cProfile -s tottime prog > profile.txt
head -10 profile.txt

# visual profile
USE_TF=0 PYTHONPATH=src python -m cProfile -o profile.pstats prog
gprof2dot -f pstats profile.pstats |  dot -Tsvg -o callgraph.svg
display callgraph.svg

Im on the same boat as @stas00 . I understand that the code need to maintain a wider compatibility across the oceans of models, but people needs a working workaround before an elegant solution born into reality. I believe as huggingface slowly graduating from pure research field, more and more people are being hurt by the tremendous model initialization time. Hoping for a change

Hello @AyeshaSarwar,

could you please use the forum: https://discuss.huggingface.co/ instead for such questions? We don’t support Flask compatibility in transformers. Please keep in mind that the issues are mainly used for issues related to just transformers.

Thanks

Yeah Patrick’s suggestion is probably the best, though I’m not sure it can easily be achieved in the current API. Note that this is only one slowdown at the beginning of training, so I don’t think this should be high priority.

If we see a significant gain in loading time, maybe it’s worth to explore a way to only apply init_weights on missing layers. Not sure how easy it would be to implement it though…

Maybe a init_weights function arg in __init__ might make sense:

model = cls(config, init_weights=False, *model_args, **model_kwargs)  # don't call init_weights, but initialize all weights to zero because it's much faster
# load weights into model and get missing layers
# init missing layers