transformers: `resume_from_checkpoint` function fails because "There seems to be not a single sample in your epoch_iterator"
System Info
transformers version - 4.33.2
I’m using the trainer api as such, so it pushes the latest checkpoint to huggingface each epoch:
from transformers import TrainingArguments, Trainer
new_model_name = "videomae-finetuned"
num_epochs = 50
batch_size = 8
steps_per_epoch = train_dataset.num_videos // batch_size
args = TrainingArguments(
output_dir=new_model_name,
remove_unused_columns=False,
evaluation_strategy="epoch",
save_strategy="epoch",
save_total_limit = 2, # Only last 2 models are saved. Older ones are deleted.
learning_rate=5e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
warmup_ratio=0.1,
logging_steps=10,
max_steps=steps_per_epoch * num_epochs, # Duplication of `num_train_epochs` because it throws otherwise.
load_best_model_at_end=True,
metric_for_best_model="accuracy",
hub_strategy="checkpoint",
push_to_hub=True,
num_train_epochs=num_epochs,
)
from transformers import EarlyStoppingCallback
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=image_processor,
compute_metrics=compute_metrics,
data_collator=collate_fn,
callbacks = [EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)
import traceback
try:
results = trainer.train()
except RuntimeError as e:
print(traceback.format_exc())
And after about 25 epochs there’s some exception (never mind what). So I get the last checkpoint being saved to huggingface (from here, if it matters) and put it on my drive, change the training code to this:
import traceback
try:
results = trainer.train(resume_from_checkpoint=pathlib.Path(f"./drive/MyDrive/").joinpath("last-checkpoint"))
except RuntimeError as e:
print(traceback.format_exc())
And rerun the whole notebook. Than, it prints (after some time - not immidiatlly):
There seems to be not a single sample in your epoch_iterator, stopping training at step 5500! This is expected if you’re using an IterableDataset and set num_steps (12500) higher than the number of available samples.
And than fails.
I do have an IterableDataset with 2000 training videos, and I’m using batch size 8 and want to run for 50 epochs, so I’m pretty sure 12500 is (2000/8)*50, but I still don’t understand the message. Why is it problematic that num_steps (12500) > number of samples (2000)?
Thank you!
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Can’t really for my code, but it is based on your guide and I believe will reproduce for this as well.
Expected behavior
Continuing the training from the same state it stopped before.
About this issue
- Original URL
- State: open
- Created 9 months ago
- Reactions: 5
- Comments: 16 (11 by maintainers)
I’m sorry but this response doesn’t make sense
and this issue should not be marked as closed so prematurely. The max number of steps passed to the trainer indicates the maximum number of steps over the entire training run. However, when resuming from checkpoint, the run will stop training if the number of steps is less than the number of samples within a single epoch.To clarify, you are technically correct that “your iterable dataset should reiterate when reaching the end.” However, the
Trainerand/orIterableDatasetclasses should handle this- as they already do when not resuming from checkpoint.It is unclear why resuming from checkpoint causes them to fail to handle this. When not resuming from checkpoint, the training logic is as you expect: if you run out of samples in the current epoch but haven’t reached max steps yet, you just start a new epoch until you do reach max steps.
@pacman100 Hello, I am also facing the same issue as @Ubadub is reporting. Here is my code to reproduce the issue:
output:
When not resuming, Trainer runs until 20 steps. When resuming from a checkpoint, it tries to run until 10 steps. This seems inconsistent.
As discussed in https://github.com/huggingface/transformers/issues/26635, I think the correct behavior suggested by the current documentation of
max_stepsshould be Trainer reiterating the dataset until 20 steps are executed even if the dataset is finite and smaller than 20. https://github.com/huggingface/transformers/blob/95091e1582688c2ffd8342918f3eb0e3abeeb0c8/src/transformers/training_args.py#L236-L239I’m using Python v3.10.12, transformers==4.36.2, datasets==2.16.1, accelerate==0.26.0, torch==2.1.2.