accelerate: The more GPU I use, the slower the training speed.
I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the strange thing is, the time spent on one step grows sharply with the number of GPU and the total time using multiple GPUs is similar to single GPU. I directly run the sample code provided on this link and the problem still occurs. BTW, I have run the transformers.trainer using multiple GPUs on this machine, and the time per step only increae a little on distributed training.
The CUDA version shown by nvidia-smi is 11.4 and the environment is:
transformersversion: 4.11.3- Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
- Python version: 3.7.6
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
The relevant outputs on two GPUs are:
FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
cuda:0
10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False
cuda:1
10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False
..........................
10/28/2021 20:22:28 - INFO - __main__ - ***** Running training *****
10/28/2021 20:22:28 - INFO - __main__ - Num examples = 4627
10/28/2021 20:22:28 - INFO - __main__ - Num Epochs = 3
10/28/2021 20:22:28 - INFO - __main__ - Instantaneous batch size per device = 2
10/28/2021 20:22:28 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 32
10/28/2021 20:22:28 - INFO - __main__ - Gradient Accumulation steps = 8
10/28/2021 20:22:28 - INFO - __main__ - Total optimization steps = 435
0%|▏ | 1/435 [00:11<1:24:51, 11.73s/it]
10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration.
32%|███████████████████████████████▌ | 140/435 [02:52<05:42, 1.16s/it]
The outputs on single GPU:
10/28/2021 20:26:47 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Use FP16 precision: False
.......................
10/28/2021 20:27:49 - INFO - __main__ - ***** Running training *****
10/28/2021 20:27:49 - INFO - __main__ - Num examples = 4627
10/28/2021 20:27:49 - INFO - __main__ - Num Epochs = 3
10/28/2021 20:27:49 - INFO - __main__ - Instantaneous batch size per device = 2
10/28/2021 20:27:49 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 16
10/28/2021 20:27:49 - INFO - __main__ - Gradient Accumulation steps = 8
10/28/2021 20:27:49 - INFO - __main__ - Total optimization steps = 870
4%|███▉ | 35/870 [00:17<06:34, 2.12it/s]
The hightlight positions are tjat the time per step sharply increase on distributed training and the total time is similar in two settings.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15
!!!I take your suggestion and find the total training time shortens to half!
Are they correct? I am not sure so I use the evalution perplexity to check. I change
gradient_accumulation_stepsto make sure the batch size same and set the random seed. But I find the evaluation perplexity are different between single GPU and two GPUs. Does this normal?Is it sensibly different? The shuffling will be done the same between one or two GPUs, Accelerate makes sure of that, but I’m less sure of the random masking part (which is outside of Accelerate) and any other randomness that might occur.
Glad to see the training speed is better! I’ll work on something today and early next week to include something easy in Accelerate. Would love if you could test it when it’s ready!
I was mentioning that issue indeed! Thanks for digging it 😃
Accelerate and the Trainer use exactly the same code behind the scenes (I wrote both 😅 ):
torch.distributedso there shouldn’t be any differences. You mention DDP, so does that mean you tested your code with both Accelerate and vanilla DDP and got the same slowdown? The one difference I can think of is that theTrainerusesfind_unused_parameters=Truewhen definining theDistributedDataParallelmodel (by default), and I think PyTorch uses the opposite. Could you try?You can pass along a list of
kwarg_handlersto theAcceleratorobject to do this with Accelerate (the DDP one here).