transformers: auto_find_batch_size=True and eval_steps=ratio unexpected behavior
System Info
transformers
version: 4.30.1- Platform: Linux-5.7.19-050719-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.15.1
- Safetensors version: 0.3.1
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I don’t have a full example that I can share, but I think this is a simple enough problem that one may not be needed.
I am using TrainingArguments(auto_find_batch_size=True, eval_steps=0.1, per_device_train_size=1024)
. With batch size of 1024, I have 657 steps. The eval ratio appears to be evaluated on this, with evaluation happening every 66 steps.
However, the automatic batch size adjusts to 16, and a corresponding 83787 steps. But the evaluation is still performed every 66 steps.
Expected behavior
I expected the eval steps to be recomputed when the batch size updated. In the example above, I expected evaluation to occur every ~8000 steps.
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 1
- Comments: 26 (9 by maintainers)
Finally fixed on main 😄
Great! I’ll open a PR, thank you so much for your patience and clear bug report @edmcman
Yes, it is working for me too now!
(Edit: I forgot I added the exception for debugging 🤣)
Thanks, I will try this again. It’s possible I goofed and didn’t reload the new code or something when I thought I did.
Ping to keep fresh
On Thu, Jul 13, 2023, 10:02 AM github-actions[bot] - @.*** <github.edmcman.99c9f1b9d0.notifications# @.***> wrote:
Enjoy your holiday. If I have some spare time I’ll see if I can figure out what is going wrong yet…
I’m sorry to report that I still think it is broken!
I can’t run it on colab because I’m out of free GPU usage, but I did upload it, and I think it should work if you have GPU access there:
https://colab.research.google.com/drive/1A-MzFHIbWtrtO4tjf2GROAdfAueEHidw?usp=sharing
Okay, I think it should be fixed now. Can you try again via the same branch?
I see the problem I think:
Since this actually modifies
args.eval_steps
, the ratio will be lost the first time we run this code. E.g., this will setargs.eval_steps
to 66 and lose 0.1.Looks like
max_steps
is not being updatedWith the patch, still evaling every 66 steps. Let me try to make a reproducer. It probably won’t be minimal though…
Let me try your patch first.