accelerate: No effect from InitProcessGroupKwargs timeout
System Info
- `Accelerate` version: 0.23.0
- Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
Not found
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - My own task or dataset (give details below)
Reproduction
- Follow instructions from https://github.com/huggingface/alignment-handbook/tree/main/scripts. Install the environment to run lora sft training
- Change the timeout to 3 hours:
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])
and run the training 3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[2023-12-09 08:46:08,664] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54784 closing signal SIGTERM
[2023-12-09 08:46:11,834] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 54785) of binary: /home/evgenii/.conda/envs/handbook/bin/python
Traceback (most recent call last):
File "/home/evgenii/.conda/envs/handbook/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 971, in launch_command
deepspeed_launcher(args)
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
scripts/run_sft.py FAILED
Note that timeout is still 1800 secconds (see also https://github.com/huggingface/alignment-handbook/issues/59)
Expected behavior
Timeout is increased, and no crush.
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 15 (1 by maintainers)
@muellerzr
NCCL_ASYNC_ERROR_HANDLINGis set to1(by some of the libraries I use, I guess? I didn’t set it). In fact, the function changed in this branch is called only twice in my code, both fromtraining_argshttps://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L1871-L1873 once withself.backend=nccland once withself.backend=None.InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))can’t even influence it?I’ve also tried to set
--ddp_timeout=10800(this is what passed fromtraining_args) in my command, and it is passed to this function only in the second call; I still get the 30-minute timeout in my code.I see the exact issue, it’s due to
SFTTrainer, and is not an accelerate issue (though it is accelerate adjacent). Can you open an issue intrlfor this and ping me?I don’t have the access to the machine currently. I’ll update you when I can run stuff on it. I don’t think there was any additional information there. From logs, it’s failing after uploading the checkpoint to the hub, ie somewhere around https://github.com/huggingface/alignment-handbook/blob/ff618a4d13a2c77cf97479fac8af2c576619062a/scripts/run_sft.py#L203-L205