fairseq: m2m: generate OOMs on v100

I ran the downloads and the documented “generate on a v100” command:

fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path 12b_last_checkpoint.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[26]' \
    --pipeline-encoder-devices '[0]' \
    --pipeline-decoder-balance '[1,24,1]' \
    --pipeline-decoder-devices '[0,1,0]' > gen_out

on a V100 w torch 1.5 and I got OOM.

fairscale==0.0.3
fairseq # pip install -e . from source at 9b0611e6
torch==1.5.1+cu101

Questions

Has this command worked for others?
Does anyone have a working generate command that takes advantage of multiple gpus?

cc: @shruti-bh

Thanks in advance!

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 22 (15 by maintainers)

Commits related to this issue

Get 12B M2M-100 model generation to work correctly on exactly 2 32gb gpus (#1366) Summary: # What does this PR do? Addresses https://github.com/pytorch/fairseq/issues/2772 where external users can't ... — committed to facebookresearch/fairseq by shruti-bh 4 years ago
Get 12B M2M-100 model generation to work correctly on exactly 2 32gb gpus (#1366) Summary: # What does this PR do? Addresses https://github.com/pytorch/fairseq/issues/2772 where external users can't ... — committed to jinyiyang-jhu/fairseq-jyang by shruti-bh 4 years ago

Most upvoted comments

I will try to get these models and commands in by end of this week or early next week!

shruti-bh on Oct 22, 2020

that worked! (on 6debe291) Thanks!

sshleifer on Oct 29, 2020

Also, if you are planning on changing the .pt file, would be awesome if you could remove optimizer_states. They are 70GB and I think fairseq-generate will work without them.

sshleifer on Oct 22, 2020

@sshleifer I cannot reproduce this on my end yet. Have you pulled the latest master of fairseq? Because when I added the new model checkpoints, I also needed to make some code changes to ensure that everything worked correctly on top of the new dataclass configs that have been recently added to fairseq. Note that “model_cfg” argument exists in the load_state_dict() of PipelineParallelTransformerModel() in the latest master https://github.com/pytorch/fairseq/blob/master/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py#L323

shruti-bh on Oct 29, 2020

@sshleifer - The README (https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) now contains checkpoints that work with 4 16GB GPUs along with pipeline arguments needed to be used at generation time. Let me know if this works on your end. I also removed the optimizer states so that the size of the checkpoint is now ~48GB as @mjpost mentioned.

@damyana79 I added checkpoints that should work with 6 12GB GPUs with associated pipeline arguments to be used at generation time. We will look into adding CPU generation as well.

shruti-bh on Oct 27, 2020

@shruti-bh can you confirm that the tokenizers were only used for evaluation, and not in preprocessing of the training data? So the SPM model was applied to raw text? We’re doing some sanity checking of the model and want to make sure we have this important detail right.

@mjpost I can confirm that the tokenizers were only used for computing BLEU by tokenizing the hypotheses and references. For preprocessing the training or validation data, we do not apply tokenizers (ref: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100#introduction)

shruti-bh on Oct 26, 2020

I can confirm fairseq-generate works after removing the optimizer states (script) (reduced model is 48 GB).

mjpost on Oct 23, 2020

Do you mean you would like inference to work across 8 16GB V100 gpus

Yes, or 4 16GB V100 GPUs if possible.

sshleifer on Oct 21, 2020