transformers: `resume_from_checkpoint` function fails because "There seems to be not a single sample in your epoch_iterator"

System Info

transformers version - 4.33.2

I’m using the trainer api as such, so it pushes the latest checkpoint to huggingface each epoch:

from transformers import TrainingArguments, Trainer

new_model_name = "videomae-finetuned"
num_epochs = 50
batch_size = 8
steps_per_epoch = train_dataset.num_videos // batch_size

args = TrainingArguments(
    output_dir=new_model_name,
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit = 2, # Only last 2 models are saved. Older ones are deleted.
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    logging_steps=10,
    max_steps=steps_per_epoch * num_epochs, # Duplication of `num_train_epochs` because it throws otherwise.
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    hub_strategy="checkpoint",
    push_to_hub=True,
    num_train_epochs=num_epochs,
)

from transformers import EarlyStoppingCallback

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)

import traceback

try:
    results = trainer.train()
except RuntimeError as e:
    print(traceback.format_exc())

And after about 25 epochs there’s some exception (never mind what). So I get the last checkpoint being saved to huggingface (from here, if it matters) and put it on my drive, change the training code to this:

import traceback

try:
    results = trainer.train(resume_from_checkpoint=pathlib.Path(f"./drive/MyDrive/").joinpath("last-checkpoint"))
except RuntimeError as e:
    print(traceback.format_exc())

And rerun the whole notebook. Than, it prints (after some time - not immidiatlly):

There seems to be not a single sample in your epoch_iterator, stopping training at step 5500! This is expected if you’re using an IterableDataset and set num_steps (12500) higher than the number of available samples.

And than fails.

I do have an IterableDataset with 2000 training videos, and I’m using batch size 8 and want to run for 50 epochs, so I’m pretty sure 12500 is (2000/8)*50, but I still don’t understand the message. Why is it problematic that num_steps (12500) > number of samples (2000)?

Thank you!

Who can help?

@muellerzr @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Can’t really for my code, but it is based on your guide and I believe will reproduce for this as well.

Expected behavior

Continuing the training from the same state it stopped before.

About this issue

Original URL
State: open
Created 9 months ago
Reactions: 5
Comments: 16 (11 by maintainers)

Most upvoted comments

Hello, your iterable dataset should reiterate when reaching the end if the number of steps> number of samples in the iterable dataset.

I’m sorry but this response doesn’t make sense ~~and this issue should not be marked as closed so prematurely~~. The max number of steps passed to the trainer indicates the maximum number of steps over the entire training run. However, when resuming from checkpoint, the run will stop training if the number of steps is less than the number of samples within a single epoch.

To clarify, you are technically correct that “your iterable dataset should reiterate when reaching the end.” However, the Trainer and/or IterableDataset classes should handle this- as they already do when not resuming from checkpoint.

It is unclear why resuming from checkpoint causes them to fail to handle this. When not resuming from checkpoint, the training logic is as you expect: if you run out of samples in the current epoch but haven’t reached max steps yet, you just start a new epoch until you do reach max steps.

Ubadub on Nov 27, 2023

@pacman100 Hello, I am also facing the same issue as @Ubadub is reporting. Here is my code to reproduce the issue:

import os
import shutil
import transformers
import datasets


if os.path.exists("./output"):
    shutil.rmtree("./output")


def my_generator():
    for i in range(10):
        yield {"input_ids": [1000], "labels": [1000]}


# This dataset yields 10 examples only, but let's set max_steps=20.
dataset = datasets.IterableDataset.from_generator(my_generator)
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
args = transformers.TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=1,
    max_steps=20,
    save_steps=10,
    report_to="none",
)
trainer = transformers.Trainer(
    model=model,
    args=args,
    train_dataset=dataset,
)
trainer.train()
# Trainer runs 20 steps, producing both checkpoint-10 checkpoint-20.
assert os.path.exists("./output/checkpoint-10")
assert os.path.exists("./output/checkpoint-20")

# Now remove checkpoint-20 and resume training from checkpoint-10.
shutil.rmtree("./output/checkpoint-20")
trainer = transformers.Trainer(
    model=model,
    args=args,
    train_dataset=dataset,
)
trainer.train(resume_from_checkpoint=True)
# This time, trainer does nothing. checkpoint-20 is not produced.
assert os.path.exists("./output/checkpoint-10")
assert not os.path.exists("./output/checkpoint-20")

output:

{'train_runtime': 20.8257, 'train_samples_per_second': 0.96, 'train_steps_per_second': 0.96, 'train_loss': 0.0, 'epoch': 1.5}
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:20<00:00,  1.04s/it]
There were missing keys in the checkpoint model loaded: ['lm_head.weight'].
  0%|                                                                                                                                                                            | 0/20 [00:00<?, ?it/s]
There seems to be not a single sample in your epoch_iterator, stopping training at step 10! This is expected if you're using an IterableDataset and set num_steps (20) higher than the number of available samples.
{'train_runtime': 0.0044, 'train_samples_per_second': 4513.401, 'train_steps_per_second': 4513.401, 'train_loss': 0.0, 'epoch': 0.5}
  0%|

When not resuming, Trainer runs until 20 steps. When resuming from a checkpoint, it tries to run until 10 steps. This seems inconsistent.

As discussed in https://github.com/huggingface/transformers/issues/26635, I think the correct behavior suggested by the current documentation of max_steps should be Trainer reiterating the dataset until 20 steps are executed even if the dataset is finite and smaller than 20. https://github.com/huggingface/transformers/blob/95091e1582688c2ffd8342918f3eb0e3abeeb0c8/src/transformers/training_args.py#L236-L239

I’m using Python v3.10.12, transformers==4.36.2, datasets==2.16.1, accelerate==0.26.0, torch==2.1.2.

muupan on Jan 11, 2024