pytorch-lightning: bfloat16 running 4x slower than fp32 (conv)

🐛 Bug

I’m training a hybrid Resnet18+Conformer model using A100 GPUs. I’ve used both fp16 and fp32 precision to train the model and things work as expected: fp16 uses less memory and runs faster than when using fp32. However, when I try to switch to bf16 through Trainer(precision='bf16'), training speed is dramatically lower than with fp32, from ~1.3s/it to ~5.5s/it. The only difference between the runs is the precision argument, where I used '16' for fp16, '32' for fp32, and 'bf16' for bfloat16. The slowness of bf16 is experienced when using only 1 GPU, 8 GPUs (single node), or 32 GPUs (multiple nodes).

The docs state

BFloat16 is also experimental and may not provide significant speedups or memory improvements, offering better numerical stability.

But shouldn’t it work faster than fp32 on A100 GPUs? Any help would be hugely appreciated.

Expected behavior

For bf16 to work faster than fp32 and probably slower than fp16.

Environment

  • CUDA:
    • GPU:
      • A100-SXM4-40GB
    • available: True
    • version: 11.3
  • Packages:
    • numpy: 1.21.2
    • pyTorch_debug: False
    • pyTorch_version: 1.10.2
    • pytorch-lightning: 1.5.9
    • tqdm: 4.62.3
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.12

Additional context

Although speed is dramatically lower with bf16 than with fp32, the GPU memory allocation is lower (~55% vs ~95%), as expected.

cc @carmocca @justusschock @awaelchli @akihironitta @rohitgr7 @borda

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 17 (11 by maintainers)

Most upvoted comments

I ran the imagenet example using TORCH_CUDNN_V8_API_ENABLED=1 with PyTorch 1.12 as suggested in the forum post linked above, and recorded new values. The benchmark was run on a single A100 for 2500 iterations.

This fixes the performance regression I was seeing with imagenet bfloat16, and probably will with conformer as well!

Till CUDNN v8 becomes standard, I suggest people use the env variable to see the expected performance. Still slightly slower than FP16 but that might just be down to the underlying cudnn optimizations.

Pure PyTorch imagenet benchmark

I made 4 different versions of the script, using AMP+scaler/Bfloat16 calls

Precision seconds per iteration (s) nvidia-smi Memory (MiB)
32 0.112 12054
16 0.096 6553
bf16 0.101 7082
bf16 + model.bfloat16() 0.101 6183

PyTorch Lightning Imagenet benchmark

Precision seconds per iteration (s) nvidia-smi Memory (MiB)
32 0.115 12218
16 0.094 7480
bf16 0.101 7310
bf16 + model.bfloat16() 0.100 6634

Obviously throughput can be much better as we’re not compute-bound at all, but shows performance regression is considerably less.

Closing this issue for now. Can be re-opened if TORCH_CUDNN_V8_API_ENABLED=1 with PyTorch 1.12 does not fix the issue.

Here are my results running a pytorch-lightning minGPT script (here). This was a transformer model, with the below cmd:

python min_gpt.py --accelerator gpu --devices 1 --precision X
Precision Iterations/s nvidia-smi Memory (MiB)
16 2.26 30038
bf16 2.24 30038
bf16 + model.bfloat16() 2.40 24550

I also used torch.set_float32_matmul_precision('medium') which gave roughly the same results.

We can see that the performance is roughly the same when using AMP, but when converting the model we get a small speedup + a significant memory drop as expected.

It seems for transformer architectures BFloat16 works as expected.

I found this forum post and this comment describing slowdowns when using Conv2d + Bfloat16.

The model type in OP’s post is based on convs, and the above reference issue could explain the reason why we see a 4x slowdown.

Next steps are to re-run the imagenet comparison, and optionally the xformers example for @blefaudeux.

Hi, so I was finally able to find some time to benchmark for possible precisions (apologies for the delay).

Precision PyTorch (iters/s) PyTorch Lightning (iters/s)
float16 3.92 4.27
float32 3.77 4.22
bfloat16 4.08 2.5

Note:

  1. Tested on: Image Net dataset
  2. Scripts used:
  3. Tested on ~2500 iterations. Numbers are averaged out.
  4. Environment details:
    • PyTorch: 1.11.0
    • PyTorch Lightning: 1.6.0
    • CUDA Version: 11.3 (with PyTorch); (NVIDIA GeForce RTX 3090)
    • DDP: No (trained on a single device/gpu)

Analysis:

  • From the numbers, it looks like PyTorch Lightning is around 1.6 times slower than PyTorch for bfloat16 precision, while for other precisions - there doesn’t seem to be a huge difference. PL performs a little faster, but I assume it will be near to equal if tested for >1 epochs.
  • Also, I agree that bf16 is running slower than fp32 in PyTorch Lightning, which is not expected.

We’ll try investigating why performance is slower for bf16 in PyTorch Lightning and will get back to this issue once we have some numbers. 😃 Thank you for your patience and for reporting this issue to us!

cc: @carmocca @SeanNaren @Borda for visibility! 😃

ping @SeanNaren @ananthsub, any news on that ?

ok, synced offline and seems that it’s not a priority 😦

I’m going to revisit this. I do wonder if it’s a PyTorch thing or a PyTorch Lightning thing. If anyone has any related issues they could point me to would help! If not, going to redo the benchmarks with PyTorch 1.12, and try a different model.

ping @SeanNaren @ananthsub, any news on that ?

ok, synced offline and seems that it’s not a priority 😦

That’s great, thanks so much @SeanNaren ! The memory benefits are huge, for a similar speed it’s still very useful. Cc @matthieutphr @quentindrx

Hello, I meet the same problem with PL 1.6.4 on A100 with cuda 11.6 pytorch 1.12.0

Working with @krshrimali to try to investigate!

I’m excluding Lightning and using the pytorch imagenet example to gather the below results on an A100. I ran for more than 2500 iterations.

Precision seconds per iteration
FP32 0.103
FP32 (No TF32) 0.160
FP16 0.079
BFloat16 0.415

I also tried the latest NGC container which improved results marginally but showed a massive slowdown for BFloat16. I also tried resnet101 and still showed a slowdown.

I also tried adding torch.cuda.synchronize() before measurements were taken, and a negligible difference was made.

I think it will be worth checking other model types with a minimal script to see if we see the same slow down without PyTorch Lightning before investigating.

Note: no TF32 meant adding this:

# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
torch.backends.cuda.matmul.allow_tf32 = False

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = False