transformers: auto_find_batch_size=True and eval_steps=ratio unexpected behavior

System Info

  • transformers version: 4.30.1
  • Platform: Linux-5.7.19-050719-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I don’t have a full example that I can share, but I think this is a simple enough problem that one may not be needed.

I am using TrainingArguments(auto_find_batch_size=True, eval_steps=0.1, per_device_train_size=1024). With batch size of 1024, I have 657 steps. The eval ratio appears to be evaluated on this, with evaluation happening every 66 steps.

However, the automatic batch size adjusts to 16, and a corresponding 83787 steps. But the evaluation is still performed every 66 steps.

Expected behavior

I expected the eval steps to be recomputed when the batch size updated. In the example above, I expected evaluation to occur every ~8000 steps.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 26 (9 by maintainers)

Most upvoted comments

Finally fixed on main 😄

Great! I’ll open a PR, thank you so much for your patience and clear bug report @edmcman

Yes, it is working for me too now!

(Edit: I forgot I added the exception for debugging 🤣)

Thanks, I will try this again. It’s possible I goofed and didn’t reload the new code or something when I thought I did.

Ping to keep fresh

On Thu, Jul 13, 2023, 10:02 AM github-actions[bot] - @.*** <github.edmcman.99c9f1b9d0.notifications# @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md are likely to be ignored.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/24248#issuecomment-1634405576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHYKZPYRUBJ3KSDAYDN3BDXQAEYXANCNFSM6AAAAAAZE5ZW3U . You are receiving this because you authored the thread.Message ID: @.***>

Enjoy your holiday. If I have some spare time I’ll see if I can figure out what is going wrong yet…

I’m sorry to report that I still think it is broken!

I can’t run it on colab because I’m out of free GPU usage, but I did upload it, and I think it should work if you have GPU access there:

https://colab.research.google.com/drive/1A-MzFHIbWtrtO4tjf2GROAdfAueEHidw?usp=sharing

Okay, I think it should be fixed now. Can you try again via the same branch?

I see the problem I think:

        if args.eval_steps and args.eval_steps < 1:
            args.eval_steps = math.ceil(max_steps * args.eval_steps)

Since this actually modifies args.eval_steps, the ratio will be lost the first time we run this code. E.g., this will set args.eval_steps to 66 and lose 0.1.

Looks like max_steps is not being updated

With the patch, still evaling every 66 steps. Let me try to make a reproducer. It probably won’t be minimal though…

Let me try your patch first.