fairseq: m2m: generate OOMs on v100
I ran the downloads and the documented “generate on a v100” command:
fairseq-generate \
data_bin \
--batch-size 1 \
--path 12b_last_checkpoint.pt \
--fixed-dictionary model_dict.128k.txt \
-s de -t fr \
--remove-bpe 'sentencepiece' \
--beam 5 \
--task translation_multi_simple_epoch \
--lang-pairs language_pairs.txt \
--decoder-langtok --encoder-langtok src \
--gen-subset test \
--fp16 \
--dataset-impl mmap \
--distributed-world-size 1 --distributed-no-spawn \
--pipeline-model-parallel \
--pipeline-chunks 1 \
--pipeline-encoder-balance '[26]' \
--pipeline-encoder-devices '[0]' \
--pipeline-decoder-balance '[1,24,1]' \
--pipeline-decoder-devices '[0,1,0]' > gen_out
on a V100 w torch 1.5 and I got OOM.
fairscale==0.0.3
fairseq # pip install -e . from source at 9b0611e6
torch==1.5.1+cu101
Questions
- Has this command worked for others?
- Does anyone have a working
generate
command that takes advantage of multiple gpus?
cc: @shruti-bh
Thanks in advance!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 22 (15 by maintainers)
Commits related to this issue
- Get 12B M2M-100 model generation to work correctly on exactly 2 32gb gpus (#1366) Summary: # What does this PR do? Addresses https://github.com/pytorch/fairseq/issues/2772 where external users can't ... — committed to facebookresearch/fairseq by shruti-bh 4 years ago
- Get 12B M2M-100 model generation to work correctly on exactly 2 32gb gpus (#1366) Summary: # What does this PR do? Addresses https://github.com/pytorch/fairseq/issues/2772 where external users can't ... — committed to jinyiyang-jhu/fairseq-jyang by shruti-bh 4 years ago
I will try to get these models and commands in by end of this week or early next week!
that worked! (on
6debe291
) Thanks!Also, if you are planning on changing the
.pt
file, would be awesome if you could remove optimizer_states. They are 70GB and I thinkfairseq-generate
will work without them.@sshleifer I cannot reproduce this on my end yet. Have you pulled the latest master of fairseq? Because when I added the new model checkpoints, I also needed to make some code changes to ensure that everything worked correctly on top of the new dataclass configs that have been recently added to fairseq. Note that “model_cfg” argument exists in the
load_state_dict()
ofPipelineParallelTransformerModel()
in the latest master https://github.com/pytorch/fairseq/blob/master/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py#L323@sshleifer - The README (https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) now contains checkpoints that work with 4 16GB GPUs along with pipeline arguments needed to be used at generation time. Let me know if this works on your end. I also removed the optimizer states so that the size of the checkpoint is now ~48GB as @mjpost mentioned.
@damyana79 I added checkpoints that should work with 6 12GB GPUs with associated pipeline arguments to be used at generation time. We will look into adding CPU generation as well.
@mjpost I can confirm that the tokenizers were only used for computing BLEU by tokenizing the hypotheses and references. For preprocessing the training or validation data, we do not apply tokenizers (ref: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100#introduction)
I can confirm
fairseq-generate
works after removing the optimizer states (script) (reduced model is 48 GB).Yes, or 4 16GB V100 GPUs if possible.