fairseq: m2m: generate OOMs on v100

I ran the downloads and the documented “generate on a v100” command:

fairseq-generate \
    data_bin \
    --batch-size 1 \
    --path 12b_last_checkpoint.pt \
    --fixed-dictionary model_dict.128k.txt \
    -s de -t fr \
    --remove-bpe 'sentencepiece' \
    --beam 5 \
    --task translation_multi_simple_epoch \
    --lang-pairs language_pairs.txt \
    --decoder-langtok --encoder-langtok src \
    --gen-subset test \
    --fp16 \
    --dataset-impl mmap \
    --distributed-world-size 1 --distributed-no-spawn \
    --pipeline-model-parallel \
    --pipeline-chunks 1 \
    --pipeline-encoder-balance '[26]' \
    --pipeline-encoder-devices '[0]' \
    --pipeline-decoder-balance '[1,24,1]' \
    --pipeline-decoder-devices '[0,1,0]' > gen_out

on a V100 w torch 1.5 and I got OOM.

fairscale==0.0.3
fairseq # pip install -e . from source at 9b0611e6
torch==1.5.1+cu101

Questions

  1. Has this command worked for others?
  2. Does anyone have a working generate command that takes advantage of multiple gpus?

cc: @shruti-bh

Thanks in advance!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 22 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I will try to get these models and commands in by end of this week or early next week!

that worked! (on 6debe291) Thanks!

Also, if you are planning on changing the .pt file, would be awesome if you could remove optimizer_states. They are 70GB and I think fairseq-generate will work without them.

@sshleifer I cannot reproduce this on my end yet. Have you pulled the latest master of fairseq? Because when I added the new model checkpoints, I also needed to make some code changes to ensure that everything worked correctly on top of the new dataclass configs that have been recently added to fairseq. Note that “model_cfg” argument exists in the load_state_dict() of PipelineParallelTransformerModel() in the latest master https://github.com/pytorch/fairseq/blob/master/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py#L323

@sshleifer - The README (https://github.com/pytorch/fairseq/tree/master/examples/m2m_100) now contains checkpoints that work with 4 16GB GPUs along with pipeline arguments needed to be used at generation time. Let me know if this works on your end. I also removed the optimizer states so that the size of the checkpoint is now ~48GB as @mjpost mentioned.

@damyana79 I added checkpoints that should work with 6 12GB GPUs with associated pipeline arguments to be used at generation time. We will look into adding CPU generation as well.

@shruti-bh can you confirm that the tokenizers were only used for evaluation, and not in preprocessing of the training data? So the SPM model was applied to raw text? We’re doing some sanity checking of the model and want to make sure we have this important detail right.

@mjpost I can confirm that the tokenizers were only used for computing BLEU by tokenizing the hypotheses and references. For preprocessing the training or validation data, we do not apply tokenizers (ref: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100#introduction)

I can confirm fairseq-generate works after removing the optimizer states (script) (reduced model is 48 GB).

Do you mean you would like inference to work across 8 16GB V100 gpus

Yes, or 4 16GB V100 GPUs if possible.