TTS: Address Already in Use Error When Training on 2 GPUS and Starting a New Job on the Remaining 2 GPUS
Describe the bug
Following the steps on the [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)
GlowTTS:
python3 -m trainer.distribute --script train.py --gpus "0,1"
Vocoder:
python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"
Getting this exception for the second command: Address already in use
To Reproduce
-
Download the dataset per this [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)
-
Run GlowTTS:
python3 -m trainer.distribute --script train.py --gpus "0,1"
-
Run Vocoder:
python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"
Expected behavior
No response
Logs
Traceback (most recent call last):
File "train_vocoder.py", line 44, in <module>
TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
File "/apps/tts/Trainer/trainer/trainer.py", line 460, in __init__
self.config.distributed_url,
File "/apps/tts/Trainer/trainer/utils/distributed.py", line 62, in init_distributed
group_name=group_name,
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
Environment
{
"CUDA": {
"GPU": [
"NVIDIA A100-SXM4-40GB",
"NVIDIA A100-SXM4-40GB",
"NVIDIA A100-SXM4-40GB",
"NVIDIA A100-SXM4-40GB"
],
"available": true,
"version": "11.5"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0+cu115",
"TTS": "0.7.0",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.7.13",
"version": "#61~18.04.3-Ubuntu SMP Fri Oct 1 14:04:01 UTC 2021"
}
}
Additional context
First command starts with this log:
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=0']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=1']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=2']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=3']
> Using CUDA: True
> Number of GPUs: 4
Second command starts with this log:
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=0']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=1']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=2']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=3']
> Using CUDA: True
> Number of GPUs: 4
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (12 by maintainers)
@erogol @Dapwner
If you request GPU
0
in your list of GPUs, it will be used correctly, as in the case with my first command, but otherwise, it’s the first GPU id in your list which becomes the master. Then there is a mismatch between a GPU ID (GPU ID is being used in the current released code) and the rank (rank0
is treated as the master). This also explains the test @Dapwner ran:Basically, the exception happens on the distributed training with GPUs list starting at
1
and up. Essentially, they run master-less. No model files are being stored by the master, then this exception happens -FileNotFoundError
.So, in my second example I used GPU list
5,6,7
:/apps/tts/TTS # nohup python3 -m trainer.distribute --script train_hifigan_vocoder_en.py --gpus "5,6,7" --coqpit.distributed_url "tcp://localhost:54322" </dev/null > hifigan_en.log 2>&1 &
A rank is printed correctly in the logs initially - GPU 5 gets rank 0. But then during the training at runtime, once it is run on the GPUs, the pytorch gets GPU ID, instead of the rank using the current released code, so it looses the master all together, thus the ghost processes.
@lexkoro @erogol @Dapwner
I ended up debugging this more at runtime and tested a fix.
It turns out that the current Trainer
distributed.py
behavior is using a GPU id for identifying the current device rank coming from one of the env vars. I opened a PR for the Trainer project. The proposed behavior is using thetorch.distributed.get_rank()
Thank you, @erogol . Is it
distributed_url
parameter?