accelerate: v0.25 breaks my test cases - totally different training losses than v0.24
System Info
- `Accelerate` version: 0.24.1
- Platform: Linux-5.15.120+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Numpy version: 1.23.5
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 51.00 GB
- GPU type: Tesla T4
- `Accelerate` default config:
Not found
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
Accelerate v0.24.1: https://colab.research.google.com/drive/1VD4PXDhcx1t5dM2hYewyFK891GNfcELI?usp=sharing
Accelerate v0.25.0: https://colab.research.google.com/drive/1-5MdIOp0cM0scC-CdRZhh8OYhnGHqct4?usp=sharing
Expected behavior
During testing for https://github.com/unslothai/unsloth, the first 7 steps for training errors on v0.24.1 should be:
| Step | Loss |
|---|---|
| 1 | 1.051200 |
| 2 | 1.317400 |
| 3 | 1.874600 |
| 4 | 1.451000 |
| 5 | 1.611200 |
| 6 | 1.385900 |
| 7 | 1.443000 |
But then on v0.25:
| Step | Loss |
|---|---|
| 1 | 1.691600 |
| 2 | 1.662400 |
| 3 | 1.790300 |
| 4 | 1.853500 |
| 5 | 1.421400 |
| 6 | 1.478500 |
| 7 | 1.437300 |
The Colab notebook’s seed is 3407. I tried it on Tesla T4, A100 and the different training losses persist.
I can’t really theorize exactly what’s going on, but I narrowed it down to accelerate’s new version.
- I checked if bitsandbytes, transformers, datasets, peft, trl and even tim dettmer’s dataset whether they changed - all no
- Accelerate got updated 20 hrs ago - and the Colab notebooks show if you install an old version, the errors are again equivalent.
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 21
Sorry @muellerzr didn’t report back on the final loss - but to insert myself in the convo with @Qubitium and you, since @Qubitium tested over epochs, and the old behaivor was not randomizing across epochs (if I’m not mistaken), could that be the reason possibly why 0.24’s train loss is lower than 0.25?
I’m not sure on eval loss though. Sadly I don’t have a beefy GPU so running eval will be quite painful.
I can test the new PR as well!
@muellerzr Ohh noo I doubt it’s a performance degradation - you’re comparing final epoch error, whilst I’m comparing per step error.
If in fact you show that at the end of each epoch the F1 accuracy is the same or better, I would trust your changes and just update my comparisons 😃
I normally compare each step since after every change I do to make Llama training faster, I have to verify if the errors match - but I don’t want to wait for 5 hours to get a final error - I’ll just compare the first 60 steps for eg.
Then that’s how it should be and things are running properly.
What if you used <0.24.0? We hit a bug where the data was never being shuffled so you were overfitting on the same samples being shown in the same order each time
@muellerzr Oh ok interesting - sorry didn’t get to testing it (kinda forgot my my end so much apologies!!) Happy New Year!
Hi @danielhanchen @Qubitium, after some thorough testing we’ve found that if you set the seed using
accelerate.utils.set_seed, the results should be negligible. I saw a difference of ~0.02% accuracy when training from scratch on the cv example. However, since this is needed we’re going to revert this behavior and keep the old one as a default, but this is just a different sampling technique at the end of the day with a pinch more setup.Yes, this is the final train loss on 1 gpu over 6 epochs.
We also oberserved huge diff in train/loss between 0.24 and 0.25
lama2 13B native bf16 finetune using sft and flashattn2 with all train params being equal other than accelerate pkg:
@muellerzr I’ll do just that! I’ll report back to compare the entire training loss curves!
No worries at all! I also don’t like that we radically shifted the performance of the library/framework though. It should be configurable in the end. If possible though just so we can make sure, if you’d be willing to run your script again and verify the performance does increase on your seed, that would be great. If it is a degradation instead, all the more reason. (And real data is better than something like our little NLP example 😃 )
@muellerzr waitt so you mean v 0.22, 0.23, 0.24 were all incorrectly handling the sampling? Oh my! Good catch with the find!!!