transformers: Wrong checkpoint got deleted when use_mtime=True

System Info

transformers 4.25.1 Ubuntu 22.04 in docker

Description

We have a job with checkpointing per 200 iters and keeps at most 3 checkpoints. However, we found in a 1100-iter job, only 400/600/800 was saved at the end. 200 is rotated as expected, but 1000 is missing as well.

Looking into the log, we can see checkpoint-1000 got deleted immediately after saving. See the last line:

 91%|█████████ | 1000/1100 [...][INFO|trainer.py:2693] 2023-10-20 12:24:47,419 >> Saving model checkpoint to output/checkpoint-1000
[INFO|configuration_utils.py:447] 2023-10-20 12:24:49,472 >> Configuration saved in output/checkpoint-1000/config.json
[INFO|modeling_utils.py:1637] 2023-10-20 12:24:55,486 >> Model weights saved in output/checkpoint-1000/pytorch_model.bin
[INFO|tokenization_utils_base.py:2157] 2023-10-20 12:24:55,799 >> tokenizer config file saved in output/checkpoint-1000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2164] 2023-10-20 12:24:56,111 >> Special tokens file saved in output/checkpoint-1000/special_tokens_map.json
[INFO|trainer.py:2771] 2023-10-20 12:28:06,058 >> Deleting older checkpoint [output/checkpoint-1000] due to args.save_total_limit

And this line may be the root cause:

Our filesystem underneath, for the output dir, is a fuse fs to http blob storage. So most likely it’s not POSIX. mtime is probably wrong, or maybe it’s simply 0.

Code was first introduced by @sgugger in:

Fix to this should be straightforward, i.e. set it to False to rely on the numeric part of dir name. But we’d like to know why it is True currently.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 15 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Just a simple fix, let me send a PR.

#28364 might not fix it but related