transformers: Wrong checkpoint got deleted when use_mtime=True
System Info
transformers 4.25.1 Ubuntu 22.04 in docker
Description
We have a job with checkpointing per 200 iters and keeps at most 3 checkpoints. However, we found in a 1100-iter job, only 400/600/800 was saved at the end. 200 is rotated as expected, but 1000 is missing as well.
Looking into the log, we can see checkpoint-1000 got deleted immediately after saving. See the last line:
91%|█████████ | 1000/1100 [...][INFO|trainer.py:2693] 2023-10-20 12:24:47,419 >> Saving model checkpoint to output/checkpoint-1000
[INFO|configuration_utils.py:447] 2023-10-20 12:24:49,472 >> Configuration saved in output/checkpoint-1000/config.json
[INFO|modeling_utils.py:1637] 2023-10-20 12:24:55,486 >> Model weights saved in output/checkpoint-1000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2157] 2023-10-20 12:24:55,799 >> tokenizer config file saved in output/checkpoint-1000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2164] 2023-10-20 12:24:56,111 >> Special tokens file saved in output/checkpoint-1000/special_tokens_map.json
[INFO|trainer.py:2771] 2023-10-20 12:28:06,058 >> Deleting older checkpoint [output/checkpoint-1000] due to args.save_total_limit
And this line may be the root cause:
Our filesystem underneath, for the output dir, is a fuse fs to http blob storage. So most likely it’s not POSIX. mtime is probably wrong, or maybe it’s simply 0.
Code was first introduced by @sgugger in:
Fix to this should be straightforward, i.e. set it to False to rely on the numeric part of dir name. But we’d like to know why it is True currently.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 15 (10 by maintainers)
Commits related to this issue
- Do not use mtime for checkpoint rotation. Resolve https://github.com/huggingface/transformers/issues/26961 — committed to xkszltl/transformers by xkszltl 5 months ago
- Do not use mtime for checkpoint rotation. (#28862) Resolve https://github.com/huggingface/transformers/issues/26961 — committed to huggingface/transformers by xkszltl 5 months ago
- ValueError: OLMoForCausalLM does not support Flash Attention 2.0 yet #29145 (#1) * Add qwen2 (#29145) * add config, modeling, and tokenization * add auto and init * update readme * update... — committed to KaifAhmad1/transformers by KaifAhmad1 4 months ago
Just a simple fix, let me send a PR.
#28364 might not fix it but related