transformers: Wrong checkpoint got deleted when use_mtime=True

System Info

transformers 4.25.1 Ubuntu 22.04 in docker

Description

We have a job with checkpointing per 200 iters and keeps at most 3 checkpoints. However, we found in a 1100-iter job, only 400/600/800 was saved at the end. 200 is rotated as expected, but 1000 is missing as well.

Looking into the log, we can see checkpoint-1000 got deleted immediately after saving. See the last line:

 91%|█████████ | 1000/1100 [...][INFO|trainer.py:2693] 2023-10-20 12:24:47,419 >> Saving model checkpoint to output/checkpoint-1000
[INFO|configuration_utils.py:447] 2023-10-20 12:24:49,472 >> Configuration saved in output/checkpoint-1000/config.json
[INFO|modeling_utils.py:1637] 2023-10-20 12:24:55,486 >> Model weights saved in output/checkpoint-1000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2157] 2023-10-20 12:24:55,799 >> tokenizer config file saved in output/checkpoint-1000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2164] 2023-10-20 12:24:56,111 >> Special tokens file saved in output/checkpoint-1000/special_tokens_map.json
[INFO|trainer.py:2771] 2023-10-20 12:28:06,058 >> Deleting older checkpoint [output/checkpoint-1000] due to args.save_total_limit

And this line may be the root cause:

https://github.com/huggingface/transformers/blob/31d452c68b34c2567b62924ee0df40a83cbc52d5/src/transformers/trainer.py#L2267

Our filesystem underneath, for the output dir, is a fuse fs to http blob storage. So most likely it’s not POSIX. mtime is probably wrong, or maybe it’s simply 0.

Code was first introduced by @sgugger in:

https://github.com/huggingface/transformers/pull/7431

Fix to this should be straightforward, i.e. set it to False to rely on the numeric part of dir name. But we’d like to know why it is True currently.

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 15 (10 by maintainers)

Commits related to this issue

Do not use mtime for checkpoint rotation. Resolve https://github.com/huggingface/transformers/issues/26961 — committed to xkszltl/transformers by xkszltl 5 months ago
Do not use mtime for checkpoint rotation. (#28862) Resolve https://github.com/huggingface/transformers/issues/26961 — committed to huggingface/transformers by xkszltl 5 months ago
ValueError: OLMoForCausalLM does not support Flash Attention 2.0 yet #29145 (#1) * Add qwen2 (#29145) * add config, modeling, and tokenization * add auto and init * update readme * update... — committed to KaifAhmad1/transformers by KaifAhmad1 4 months ago

Most upvoted comments

Just a simple fix, let me send a PR.

xkszltl on Feb 5, 2024

#28364 might not fix it but related

ArthurZucker on Jan 11, 2024