transformers: auto_find_batch_size=True and eval_steps=ratio unexpected behavior

System Info

transformers version: 4.30.1
Platform: Linux-5.7.19-050719-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I don’t have a full example that I can share, but I think this is a simple enough problem that one may not be needed.

I am using TrainingArguments(auto_find_batch_size=True, eval_steps=0.1, per_device_train_size=1024). With batch size of 1024, I have 657 steps. The eval ratio appears to be evaluated on this, with evaluation happening every 66 steps.

However, the automatic batch size adjusts to 16, and a corresponding 83787 steps. But the evaluation is still performed every 66 steps.

Expected behavior

I expected the eval steps to be recomputed when the batch size updated. In the example above, I expected evaluation to occur every ~8000 steps.

About this issue

Original URL
State: closed
Created a year ago
Reactions: 1
Comments: 26 (9 by maintainers)

Most upvoted comments

Finally fixed on main 😄

muellerzr on Aug 10, 2023

Great! I’ll open a PR, thank you so much for your patience and clear bug report @edmcman

muellerzr on Aug 8, 2023

Yes, it is working for me too now!

(Edit: I forgot I added the exception for debugging 🤣)

edmcman on Aug 8, 2023

Thanks, I will try this again. It’s possible I goofed and didn’t reload the new code or something when I thought I did.

edmcman on Aug 8, 2023

Ping to keep fresh

On Thu, Jul 13, 2023, 10:02 AM github-actions[bot] - @.*** <github.edmcman.99c9f1b9d0.notifications# @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md are likely to be ignored.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/24248#issuecomment-1634405576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHYKZPYRUBJ3KSDAYDN3BDXQAEYXANCNFSM6AAAAAAZE5ZW3U . You are receiving this because you authored the thread.Message ID: @.***>

edmcman on Jul 13, 2023

Enjoy your holiday. If I have some spare time I’ll see if I can figure out what is going wrong yet…

edmcman on Jun 14, 2023

I’m sorry to report that I still think it is broken!

edmcman on Jun 14, 2023

I can’t run it on colab because I’m out of free GPU usage, but I did upload it, and I think it should work if you have GPU access there:

https://colab.research.google.com/drive/1A-MzFHIbWtrtO4tjf2GROAdfAueEHidw?usp=sharing

edmcman on Jun 13, 2023

Okay, I think it should be fixed now. Can you try again via the same branch?

muellerzr on Jun 13, 2023

I see the problem I think:

        if args.eval_steps and args.eval_steps < 1:
            args.eval_steps = math.ceil(max_steps * args.eval_steps)

Since this actually modifies args.eval_steps, the ratio will be lost the first time we run this code. E.g., this will set args.eval_steps to 66 and lose 0.1.

edmcman on Jun 13, 2023

Looks like max_steps is not being updated

edmcman on Jun 13, 2023

With the patch, still evaling every 66 steps. Let me try to make a reproducer. It probably won’t be minimal though…

edmcman on Jun 13, 2023

Let me try your patch first.

edmcman on Jun 13, 2023