accelerate: The more GPU I use, the slower the training speed.

I am trying to train the Bert-base-uncased model on Nvidia 3080. However, the strange thing is, the time spent on one step grows sharply with the number of GPU and the total time using multiple GPUs is similar to single GPU. I directly run the sample code provided on this link and the problem still occurs. BTW, I have run the transformers.trainer using multiple GPUs on this machine, and the time per step only increae a little on distributed training.

The CUDA version shown by nvidia-smi is 11.4 and the environment is:

transformers version: 4.11.3
Platform: Linux-5.11.0-38-generic-x86_64-with-debian-bullseye-sid
Python version: 3.7.6
PyTorch version (GPU?): 1.9.0+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: <fill in>

The relevant outputs on two GPUs are:

FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
cuda:0
10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Use FP16 precision: False

cuda:1
10/28/2021 20:21:55 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Use FP16 precision: False

..........................

10/28/2021 20:22:28 - INFO - __main__ - ***** Running training *****
10/28/2021 20:22:28 - INFO - __main__ -   Num examples = 4627
10/28/2021 20:22:28 - INFO - __main__ -   Num Epochs = 3
10/28/2021 20:22:28 - INFO - __main__ -   Instantaneous batch size per device = 2
10/28/2021 20:22:28 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 32
10/28/2021 20:22:28 - INFO - __main__ -   Gradient Accumulation steps = 8
10/28/2021 20:22:28 - INFO - __main__ -   Total optimization steps = 435
  0%|▏                                                                                                 | 1/435 [00:11<1:24:51, 11.73s/it]
10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration.
10/28/2021 20:22:40 - INFO - root - Reducer buckets have been rebuilt in this iteration.
 32%|███████████████████████████████▌                                                                  | 140/435 [02:52<05:42,  1.16s/it]

The outputs on single GPU:

10/28/2021 20:26:47 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Use FP16 precision: False

.......................

10/28/2021 20:27:49 - INFO - __main__ - ***** Running training *****
10/28/2021 20:27:49 - INFO - __main__ -   Num examples = 4627
10/28/2021 20:27:49 - INFO - __main__ -   Num Epochs = 3
10/28/2021 20:27:49 - INFO - __main__ -   Instantaneous batch size per device = 2
10/28/2021 20:27:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
10/28/2021 20:27:49 - INFO - __main__ -   Gradient Accumulation steps = 8
10/28/2021 20:27:49 - INFO - __main__ -   Total optimization steps = 870
  4%|███▉                                                                                               | 35/870 [00:17<06:34,  2.12it/s]

The hightlight positions are tjat the time per step sharply increase on distributed training and the total time is similar in two settings.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15

Most upvoted comments

!!!I take your suggestion and find the total training time shortens to half!

    for epoch in range(args.num_train_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
-            outputs = model(**batch)
-            loss = outputs.loss
-            loss = loss / args.gradient_accumulation_steps
-            accelerator.backward(loss)
+            if (step + 1) % args.gradient_accumulation_steps != 0:
+                with model.no_sync():
+                    outputs = model(**batch)
+                    loss = outputs.loss
+                    loss = loss / args.gradient_accumulation_steps
+                    accelerator.backward(loss)
+            else:
+                outputs = model(**batch)
+                loss = outputs.loss
+                loss = loss / args.gradient_accumulation_steps
+                accelerator.backward(loss)

            if (step + 1) % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
                progress_bar.update(1)
                completed_steps += 1

Are they correct? I am not sure so I use the evalution perplexity to check. I change gradient_accumulation_steps to make sure the batch size same and set the random seed. But I find the evaluation perplexity are different between single GPU and two GPUs. Does this normal?

hobbitlzy on Nov 5, 2021

Is it sensibly different? The shuffling will be done the same between one or two GPUs, Accelerate makes sure of that, but I’m less sure of the random masking part (which is outside of Accelerate) and any other randomness that might occur.

Glad to see the training speed is better! I’ll work on something today and early next week to include something easy in Accelerate. Would love if you could test it when it’s ready!

sgugger on Nov 5, 2021

I was mentioning that issue indeed! Thanks for digging it 😃

Accelerate and the Trainer use exactly the same code behind the scenes (I wrote both 😅 ): torch.distributed so there shouldn’t be any differences. You mention DDP, so does that mean you tested your code with both Accelerate and vanilla DDP and got the same slowdown? The one difference I can think of is that the Trainer uses find_unused_parameters=True when definining the DistributedDataParallel model (by default), and I think PyTorch uses the opposite. Could you try?

You can pass along a list of kwarg_handlers to the Accelerator object to do this with Accelerate (the DDP one here).

sgugger on Nov 1, 2021