transformers: Save model checkpoint error when multi-gpu training
System Info
transformersversion: 4.36.0.dev0- Platform: Linux-6.2.0-1017-azure-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.0
- Accelerate version: 0.24.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help?
@muellerzr and @pacman100 I found when launch the example trainer code with multi-nodes, the code will raise a FileNotFound error when saving the checkpoint, and after debug, I think the reason is in trainer.py L2382:
if staging_output_dir != output_dir:
os.rename(staging_output_dir, output_dir)
When one process rename the folder, and other processes will encounter the FileNotFound error. Maybe one can modify the code like this to avoid the error:
if self.args.should_save and staging_output_dir != output_dir:
os.rename(staging_output_dir, output_dir)
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Run the MAE training code from the example folder.
Expected behavior
Solve the FileNotFound error.
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Reactions: 1
- Comments: 47 (15 by maintainers)
Hi, @snowyday , @tblattner , and @muellerzr . I think
main_process_firstmay be broken.I run the trainer with 2 nodes X 8 V100 GPUs and deepspeed. When I turned on
log_level=debug, I found that only one process entered the waiting mode, while all other processes tried to save the checkpoint.The log from process that waited:
A similar error has now occurred at L.2561 89c6481
I am experiencing this issue in a distributed training environment that utilizes a shared file system across 16 nodes, with each node equipped with 4 GPUs. I’m deploying the training using DeepSpeed’s OpenMPI launcher.
In this setup, I have observed scenarios where the cleanup command shutil.rmtree(staging_output_dir) at L.2561 in the code fails to execute due to the condition self.is_local_process_zero() not being met on the slave nodes. This is intended to “Clean up the remaining staging checkpoint folders on other nodes,” but it does not always work as expected.
[89c6481]
Although
os.path.exists(staging_output_dir)is used for verification, it seems thatstaging_output_dirdoes not exist whenshutil.rmtree(staging_output_dir)is executed. It looks like a try-except block needs to be implemented here as well.I encountered a similar error when using the trainer from DeepSpeed. The error occurs at the exact moment after
if os.path.exists(staging_output_dir):is evaluated and another process finishes renaming.I had no other choice, so I resorted to using a try block to get around it.
transformers-4.37.0.dev0
I also meet the same problem in 4,38.2. Using the 4.37,2 fix this issue.
I’m not sure if it fails or not. From what I understand, the network attached storage node might not actually complete the operation before the next process comes to check if the path exists. It will complete, just not in the timeframe allowed (sometimes). But that outlines the core issue here.
My suggestion is to use something like this:
if self.args.distributed_state.is_local_main_process if self.args.save_on_each_node else self.args.distributed_state.is_main_process:Then
self.args.distributed_state.wait_for_everyone()to synchronize everyone afterwards.This would only use the main process if save_on_each_node is false, otherwise only the local main processes. Which I think is the intended behavior. The part I’m not sure of is if the renamed file is used later downstream, then that could introduce a race condition there…
It would be nice if we could have an fsync for the shared filesystem to ensure the rename actually completed.
I’ve checked the
main_process_firstusing the code snippet below: Number of nodes: 3 Processes per node (GPUs): 4 Total: 12 processesThe node rankings appear to be correctly allocated, with Node rank 0 going to node 1, Node rank 4 to node 2, and Node rank 8 to node 3; however, there are inaccuracies with the global rankings. In the context of a shared filesystem, if we proceed without waiting for the result from global rank 0, it could cause conflicts during the os.rename operation.
@muellerzr @thundergolfer I still get the same issue of saving checkpoint using the latest version of transformers
4.36and even with‘4.37.0.dev0’I used three workers each one has two GPUs, I tried fine-tuning to be saved on a shared storage and a non-shared storage, and for both cases I still got the same error!
FileNotFoundError: [Errno 2] No such file or directory: ‘model/tmp-checkpoint-49’ -> ‘model/checkpoint-49’
although the
model/checkpoint-49is already created!I’ve been using a try-except approach for bypassing the issue, and it’s been working well for me. However, as xk-huang mentioned, it seems that the root cause is that self.args.main_process_first is not handling multi-node setups properly.
any solutions? facing the same issue on multinode training using deepspeed
Hi @snowyday - could you open a new issue, including all these details and linking to this issue? This way we can better track what’s been addressed and what’s a new issue
I can get a start on a PR. Not sure what the best methodology for running fsync on a rename operation is, but I’ll give it a shot.
FYI, we tested and also experienced this without shared FS (accelerate/
pdsh, simple two-node setup).Also, if we rely on full
fsyncimplementation in checkpoint folder, it might be good to explicitly call that out in docs as not all filesystems/mount options will fail hard on “fake”fsynccalls.@thundergolfer
Rank 0should pop up first, and the others should hang tight until the renaming wraps up. I should setargs.save_on_each_node=False:I encountered this issue with the trainer with the following command-line. This was after recently updating transformers with pip install transformers --upgrade
--save_strategy epoch --save_total_limit 1transformers==4.36.2
Edit: One thing to note this was with 2 nodes with 8x A100s per node. Looking at the code around the error, I have a feeling this was because I may have used local=True when using with main_process_first. Going to try disabling save_on_each_node.
edit edit: Looks like its still not working even when specifying save_on_each_node to false.
Here is the full command, launched from a slurm sbatch job:
@hahmad2008 can you try doing either
pip install transformers -Uor reinstall from git? From the line numbers it’s not adding up that you’re using a version that includes the fix