accelerate: Could there be a bug in mixed precision?

When I use torch 1.6.0 & accelerate 0.3.0 and set mixed precision as yes in accelerate config, nothing happens (still full precision training). If I set in the code Accelerator(fp16=True) then amp is triggered, but the loss becomes inf right away.

But if I use the pytorch way (i.e. autocast in the code myself), the training is normal and amp is enabled.

So I wonder if there is a possible bug in accelerate.

My enviroment is single 2080 Ti, local machine. The code with this problem is here.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 24

Most upvoted comments

I was able to investigate this more and I think I found the problem. The PR above should fix the issue, would you mind giving it a try?

sgugger on Aug 5, 2021

Thanks for the analysis and the example you provided. I’ll try to dig more into the differences tomorrow.

sgugger on Aug 4, 2021

Hi @sgugger, thanks for the quick reply! Unfortunately I didn’t have time to build a proper minimum working example yet, but I managed to adjust the CV example to a segmentation task minimizing changes to the code, here it is. This unfortunately in my case reproduces the problem.

I apologize for the use of the custom dataset and decoder, however if you check the code there’s nothing particularly weird about them, just standard PyTorch stuff. The dataset is also nothing out of the ordinary as you can see here. I tested the “standard” XEntropy with both reduction="sum" and reduction="mean", in the first case I get inf losses, in the latter nan (it converges as expected without fp16).

Reading around, I suspect this has little to do with accelerate, but it is rather linked to underflow and log transformations in the loss (?). I’ll try to adapt the same script to manual AMP and see if the same issue arises, otherwise I’ll see what I can do to make it self-contained, so that it can be launched without too many configuration troubles.

Cheers!

edornd on Aug 2, 2021

If you do get a simple reproducer, I’m happy to investigate more. I have just not been able to reproduce this error on my side.

sgugger on Aug 2, 2021

@edornd My solution for now is switching back to torch amp.

voldemortX on Aug 2, 2021

Hi! I’m also experiencing this weird NaN issue on the loss with mixed precision activated. When using a standard “full precision” GPU training everything is fine and converges as expected, using a custom model and nn.CrossEntropyLoss:

998/998 [04:22<00:00,  3.80batch/s, loss=0.248]

Once I turn on FP16 (doesn’t matter whether from config or args) it doesn’t break, but the loss stays nan. It also becomes noticeably faster, but I kind of expected that given the AMP setting.

... | 88/998 [00:15<02:27,  6.18batch/s, loss=nan]

I’m writing this to know if you guys found a solution perhaps? I also don’t have a minimum working example, but I’m working on uploading the code in a repository if required.

If it is of any help, I’m running on a linux machine with this package configuration:

CUDA 11.1
python==3.8.10
torch==1.9.0+cu111
torchvision==0.10.0+cu111
accelerate==0.3.0

edornd on Aug 1, 2021

You are not letting Accelerate handle mixed precision here, you are doing it in your script yourself: when the is_mixed_precision flag is True, you are also scaling the loss which means it will be scaled twice.

sgugger on Jun 7, 2021