parallelformers: AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

Hello, first of all congratulations for this amazing project. It’s simple, efficient and versatile. Very useful.

In some cases, it happens that one has several GPUs, but not enough RAM to parallelize the model. When loading the model on GPU, and then parallelizing, I’m getting the below error: AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

It doesn’t stop the script, but it seems that the parallelization fails.

My question is: is it possible to load the initial model on GPU instead of CPU (even if it’s not memory-efficient) or not at all?

Thanks!

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 29 (16 by maintainers)

Most upvoted comments

Thanks a lot @hyunwoongko . I will try the above and close this issue. I think my request goes beyond the scope of parallelformers. Thanks again!

juliensalinas on Dec 28, 2021

It works great! Thanks for the quick addition! 🥇

Thanks again for the great work, that’s very useful.

juliensalinas on Dec 17, 2021

I updated ! please upgrade library using pip install parallelformers --upgrade https://github.com/tunib-ai/parallelformers/releases/tag/v1.2

hyunwoongko on Dec 17, 2021

My question is: is it possible to load the initial model on GPU instead of CPU (even if it’s not memory-efficient) or not at all?

We have discussed several times to solve this problem. here is that discussion. https://github.com/pytorch/pytorch/issues/64327 https://github.com/huggingface/transformers/issues/13548

This issue should be solved on the pytorch side. 😦 not transformers side.

On the other hand, on the deepspeed side, there is a code designed so that the divided model can be uploaded directly to the gpu. (deepspeed.zero.Init) I don’t know much about the internal implementation, but it would be good to refer to.

https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models

hyunwoongko on Dec 17, 2021

How about using Transformers’ low_cpu_mem_usage if you run out of cpu memory? Instead, the loading speed is slower. I recommend the following code to you. I think it’s best way for low cpu memory.

from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B", low_cpu_mem_usage=True)
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

hyunwoongko on Dec 17, 2021