DeepSpeed: zero3 hangs in inference

So training works with zero3 and then I do inference calling deepspeed.forward() and while it works on a very small sample, with just slightly bigger sample it hangs with 100% gpu utilization:

Thread 0x00007f57caf71740 (most recent call first):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/cuda/streams.py", line 95 in synchronize
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 490 in _synchronize_communication
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 406 in fetch_sub_module
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1139 in pre_sub_module_forward_function
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1071 in _pre_forward_module_hook
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 451 in project
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 474 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 540 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 633 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 954 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 1505 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 893 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 872 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer_seq2seq.py", line 185 in prediction_step
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer.py", line 1800 in prediction_loop
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer.py", line 1647 in evaluate
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer_seq2seq.py", line 74 in evaluate
  File "examples/seq2seq/run_seq2seq.py", line 607 in main
  File "examples/seq2seq/run_seq2seq.py", line 655 in <module>

the trace is from faulthandler so please read in reverse.

I’m not sure if you have inference tests - may be this can be reproduced with just model.eval()?

Config:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 3,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory" : true,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e8,
        "stage3_prefetch_bucket_size": 2e5,
        "stage3_param_persitance_threshold": 1e5,
        "reduce_bucket_size": 3e6,
        "prefetch_bucket_size": 3e6,
        "sub_group_size": 1e6
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Thanks.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (12 by maintainers)

Most upvoted comments

Thank you @stas00 for digging into this. I am glad you were able to get to the core of the problem.

I understood the problem.

During inference we have an option to generate predictions which can then be scored against BLEU, etc. Different sequences may take a different number of forward passes to complete this task.

So when one gpu finished generating its predictions quicker than the others - say it decided using a criteria that it’s done at length of 10 tokens, whereas others aren’t done, and say the max_length is 15, they are now stuck waiting for the first gpu to continue running forward but it will not do that.

This makes sense. This is pretty much what I was expecting as well. Since, ZeRO-3 is a single program multiple data (SPMD) approach to parallelism with coordinated data movement, all process must be running the same program, in this case the forward on the model on each process to work correctly.

To ensure that this is so, I hacked the code to complete the while loop till it hit max_length on all gpus repeating last forward call, and the problem went away.

I am not at all sure this hack will be acceptable as:

The code where we run the generate loop - it’s quite a few call frames away from the trainer and as mentioned early doesn’t know anything about such special circumstances or that it’s running under deepspeed (or fairscale), since it was designed to work with any model.

I agree that the hack is limiting but I have a slightly different view on the “designed to work with any model” part. It seems that the code is actually designed to work only with single GPU models, and is limited in that sense. As long as the model is single GPU, it will work, but it will not work with any multi-GPU model regardless of whether it is ZeRO-3 or model parallel (tensor slicing) or pipeline parallel, since each of them requires some form of special treatment that is inherent in the parallelism itself. For example, model parallelism would require the data loader to give the same sample to all GPUs, and pipeline parallelism would require the data loader to give samples only to the first stage GPU.

A potential solution here could be to extend the code to support multi-GPU inference, by allowing for adaptable variations based on the type of parallelism being used?

It wastes resources running dumb forward calls and throw the results away.

This I think can be mitigated to a point that the waste in resource is minimal. Two potential solutions:

If the generate code can support a batch size > 1, then run with a large batch size, all running for max_len. During inference a larger batch will in general give a better throughput, and with a large batch size, the probability of getting a large sequence generated increases so the expected waste in resource will go down. Also a large batch size will significantly reduce the communication overhead of ZeRO-3.
If batch size > 1 is not supported, run all the generation for all the samples one after another until you are done with all the samples before doing anything else. As you noticed, as long as each process is running a forward on something, it will run fine. There will still be some wasted resource at the very end due to difference in the total number of generated tokens across all the queries, but this will be much less than running fake forward for each query.

But I will totally understand if you don’t have any brilliant ideas to how to overcome this hurdle and we will find some way around this.

samyam on Mar 16, 2021

Thank you @stas00! I have opened a new issue and tagged you here: https://github.com/huggingface/transformers/issues/16688

evros-chris on Apr 10, 2022

Thank you for the recipe, @samyam!

I think it’s the safest to the use the last valid input to ensure that all forward passes of sub-layers get to run in case there is some condition on the data.

I don’t think in our particular setup we could do the syncronization on the epoch-level loop.

I will save it for later, as for now we really want to make the training fast and efficient, and inference possible. We want inference so that we can quickly eval the outcome of training.

stas00 on Mar 16, 2021