transformers: Excessive GPU-GPU communication with GPT2 making multi-GPU training slow?
Summary: on a multi-GPU system, training GPT2 seems to scale poorly unless a very fast GPU-GPU interconnect like NVLink is available. In particular, without NVLink using two GPUs is slower than using just one GPU.
Environment info
transformers
version: 4.1.1- Platform: Linux-5.8.0-rc7-custom-x86_64-with-glibc2.29
- Python version: 3.8.5
- PyTorch version (GPU?): 1.8.0.dev20201214+cu110 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No?
- Hardware: 2 x NVIDIA RTX 3090 w/NVLink
Who can help
Maybe @LysandreJik or @patrickvonplaten ?
Information
Model I am using (Bert, XLNet …): GPT2
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The script is a pretty basic example of training a medium-size GPT2 from scratch. The script is here: https://panda.moyix.net/~moyix/train_csrc.py
The dataset and tokenized vocab:
- Dataset: https://panda.moyix.net/~moyix/plainsrc_all.txt.gz (718M, gzipped)
- Vocab: https://panda.moyix.net/~moyix/csrc_vocab.tar.gz
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
Training a GPT2 language model on C source code.
To reproduce
Run with only one GPU: CUDA_VISIBLE_DEVICES=0 python train_csrc.py
Run with two GPUs, NVLink disabled: NCCL_P2P_DISABLE=1 python train_csrc.py
Run with two GPUs and NVLink enabled: python train_csrc.py
Here is some benchmarking I did with my dataset on transformers 3.3.1 and 4.1.1 (note the difference in ETA is just because 3.3.1 only seems to report the ETA for the current epoch):
Version | NVLINK | GPUs | ETA | Perf |
---|---|---|---|---|
4.1.1 | Yes | 2GPU | 419:52:28 | 1.94it/s |
4.1.1 | No | 2GPU | 1025:06:27 | 1.26s/it |
4.1.1 | N/A | 1GPU | 599:14:57 | 2.72it/s |
3.3.1 | Yes | 2GPU | 83:46:51 | 1.94it/s |
3.3.1 | No | 2GPU | 204:54:22 | 1.26s/it |
3.3.1 | N/A | 1GPU | 119:02:34 | 2.73it/s |
You can see that using two GPUs is actually slower than using a single GPU, unless NVLink is available (599 hours for 1 GPU vs 1025 hours for two GPUs). So presumably there is a large amount of GPU-GPU communication going on?
Expected behavior
Scaling should be roughly linear with the number of GPUs. Unfortunately I am not very familiar with the implementation details of GPT2 in Huggingface, but others report roughly linear scaling with Transformer models like BERT so it should work here as well: https://towardsdatascience.com/training-bert-at-a-university-eedcf940c754
Although I have a system with NVLink at home, this issue is still affecting me because I would like to be able to run this on the university HPC cluster, where most nodes do not have NVLink.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 27 (15 by maintainers)
ok, a quick hack to add ratios relative to 1gpu, so now it’s easier to see the comparison.
So I added a new column runtime
ratios
and each 4 rows get recalculated wrt to their first runtime entry with 1gpu.edit: someone asked to explain the ratio and why the runtime is faster for DDP, but samples per second is smaller.
Here is a puzzle to solve:
Will one cake eater finish the cake faster than two of them?
(the answer is after the table, so you don’t see it right away)
and the answer to the puzzle posted at the beginning of this comment: 2 cake eaters will eat the cake faster together despite the slowdown, because they only have half a cake to finish each!
Same here, while each of the GPUs in the DDP assembly performs slower due to the gradient syncing, but because it has to consume half the samples, overall the assembly will train faster.
Further, this benchmark is just for 2 GPUs
So going from 1GPU to 2GPUs, you create the overhead, and so you get some loss in performance, and some gain
When you go from 2GPUs to 4GPUs (on the same node), it’s pure performance doubling. i.e. 4GPUs will perform disproportionally faster than 2GPUs over 1 GPU.
so they add this overhead, but then they can parallelize the processing so the overhead becomes almost negligible as the number of GPUs grows
The next problem is once you outgrow a single node. So the next issue is inter-node connects, which can be blazing fast (Infiniband) or super-slow (ethernet hub). So to scale from 8GPUs to 10 (for 8-gpu node), you first lose performance, because now the inter-node connection is the slow component that slows everything down. But as you add more nodes, again that overhead becomes less and less significant.
Of course when working with multi-node one often uses other parallelization techniques than DDP, so it’s PP or TP (https://huggingface.co/transformers/parallelism.html#concepts), and there one generally performs TP only inside a node, and PP and DP over nodes.
It’d be amazing if someone re-did this table for 1, 2, 4 gpus, then 1, 2, 4 nodes.
OK, now we have some extensive benchmarks for the RTX8000 machine. This machine does not have NVLink, but it apparently can do P2P GPU-GPU communication via the PCI bus. However, this seems to be quite slow – slower, in fact, than disabling P2P altogether.
Here’s
nvidia-smi topo -m
:I used the script from before (slightly expanded) and set
max-steps
to 800 for the single GPU case, 400 for two GPUs, and 200 for 4 GPUs. Here are the benchmarks (long!):I managed to get some time on a node with 4x V100s. For the Large model, it gets 3.83s/it with an ETA of 1248:01:43 (!).
Here’s the output of p2pBandwidthLatencyTest on the V100 system:
And for comparison, here’s the dual 3090 w/NVLINK system:
OK, I got around to spending some more time with this today. I realized that the
run_language_modeling.py
script can do everything my script was doing, and it uses DDP by default (Note: looking at the most recent version on git, I see thatrun_language_modeling.py
has been replaced byrun_clm.py
. However, after trying to upgrade transformers to that version, it seems to no longer use the GPU for reasons I don’t have time to debug.).So now I’m just using that, with:
For single GPU I drop the
torch.distributed.launch
and useCUDA_VISIBLE_DEVICES=1
, to disable NVLINK I useNCCL_P2P_DISABLE=1
as before. The--block_size 128
argument is to match the default from my training script (without it I run out of GPU RAM).Results:
So the general observation is that for block size 512, two GPUs without NVLink are about the same performance as a single GPU. For block size 128, two GPUs without NVLink are typically quite a bit slower than a single GPU.
It doesn’t seem like DistributedDataParallel helps with this issue, in other words.
According to this table NV4 means “Connection traversing a bonded set of 4 NVLinks”.
There are some more details in the GA102 whitepaper:
OK, so here is my benchmark with the same tool.
edit: my initial benchmark had a bug in it as pointed out by @sgugger as one has to tweak
--max_steps
if changed to more gpus - I’m proposing to change that and have a way to have a fixed dataset truncation regardless of the number of gpus used. https://github.com/huggingface/transformers/issues/9801So for 1 gpu, I had to double
--max_steps
to get the same number of items. The rest of this comment has been updated to reflect the corrected state:Hardware 2x TITAN RTX 24GB each + NVlink
I get the same bus report w/ and w/o NCCL_P2P_DISABLE=1 - I don’t think
nvidia-smi
respects this env var:but clearly the runtime is much slower w/o the NVlink as the benchmark shows, so pytorch/cuda does respect it.
Analysis:
Here is the full benchmark code and outputs:
I think @sgugger has experience with multi-GPU, and works on the example scripts, pinging him!