accelerate: v0.25 breaks my test cases - totally different training losses than v0.24

System Info

- `Accelerate` version: 0.24.1
- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 51.00 GB
- GPU type: Tesla T4
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Accelerate v0.24.1: https://colab.research.google.com/drive/1VD4PXDhcx1t5dM2hYewyFK891GNfcELI?usp=sharing

Accelerate v0.25.0: https://colab.research.google.com/drive/1-5MdIOp0cM0scC-CdRZhh8OYhnGHqct4?usp=sharing

Expected behavior

During testing for https://github.com/unslothai/unsloth, the first 7 steps for training errors on v0.24.1 should be:

Step	Loss
1	1.051200
2	1.317400
3	1.874600
4	1.451000
5	1.611200
6	1.385900
7	1.443000

But then on v0.25:

Step	Loss
1	1.691600
2	1.662400
3	1.790300
4	1.853500
5	1.421400
6	1.478500
7	1.437300

The Colab notebook’s seed is 3407. I tried it on Tesla T4, A100 and the different training losses persist.

I can’t really theorize exactly what’s going on, but I narrowed it down to accelerate’s new version.

I checked if bitsandbytes, transformers, datasets, peft, trl and even tim dettmer’s dataset whether they changed - all no
Accelerate got updated 20 hrs ago - and the Colab notebooks show if you install an old version, the errors are again equivalent.

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 21

Most upvoted comments

Sorry @muellerzr didn’t report back on the final loss - but to insert myself in the convo with @Qubitium and you, since @Qubitium tested over epochs, and the old behaivor was not randomizing across epochs (if I’m not mistaken), could that be the reason possibly why 0.24’s train loss is lower than 0.25?

I’m not sure on eval loss though. Sadly I don’t have a beefy GPU so running eval will be quite painful.

I can test the new PR as well!

danielhanchen on Dec 16, 2023

@muellerzr Ohh noo I doubt it’s a performance degradation - you’re comparing final epoch error, whilst I’m comparing per step error.

If in fact you show that at the end of each epoch the F1 accuracy is the same or better, I would trust your changes and just update my comparisons 😃

I normally compare each step since after every change I do to make Llama training faster, I have to verify if the errors match - but I don’t want to wait for 5 hours to get a final error - I’ll just compare the first 60 steps for eg.

danielhanchen on Dec 2, 2023

Then that’s how it should be and things are running properly.

muellerzr on Dec 2, 2023

What if you used <0.24.0? We hit a bug where the data was never being shuffled so you were overfitting on the same samples being shown in the same order each time

muellerzr on Dec 2, 2023

@muellerzr Oh ok interesting - sorry didn’t get to testing it (kinda forgot my my end so much apologies!!) Happy New Year!

danielhanchen on Jan 10, 2024

Hi @danielhanchen @Qubitium, after some thorough testing we’ve found that if you set the seed using accelerate.utils.set_seed, the results should be negligible. I saw a difference of ~0.02% accuracy when training from scratch on the cv example. However, since this is needed we’re going to revert this behavior and keep the old one as a default, but this is just a different sampling technique at the end of the day with a pinch more setup.

muellerzr on Jan 9, 2024

Yes, this is the final train loss on 1 gpu over 6 epochs.

Qubitium on Dec 15, 2023

We also oberserved huge diff in train/loss between 0.24 and 0.25

lama2 13B native bf16 finetune using sft and flashattn2 with all train params being equal other than accelerate pkg:

0.25 train loss: 0.5519
0.24 train loss: 0.4090

Qubitium on Dec 15, 2023

@muellerzr I’ll do just that! I’ll report back to compare the entire training loss curves!

danielhanchen on Dec 2, 2023

No worries at all! I also don’t like that we radically shifted the performance of the library/framework though. It should be configurable in the end. If possible though just so we can make sure, if you’d be willing to run your script again and verify the performance does increase on your seed, that would be great. If it is a degradation instead, all the more reason. (And real data is better than something like our little NLP example 😃 )

muellerzr on Dec 2, 2023

@muellerzr waitt so you mean v 0.22, 0.23, 0.24 were all incorrectly handling the sampling? Oh my! Good catch with the find!!!

danielhanchen on Dec 2, 2023