TTS: Address Already in Use Error When Training on 2 GPUS and Starting a New Job on the Remaining 2 GPUS

Describe the bug

Following the steps on the [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)

GlowTTS: python3 -m trainer.distribute --script train.py --gpus "0,1"

Vocoder: python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"

Getting this exception for the second command: Address already in use

To Reproduce

  1. Download the dataset per this [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)

  2. Run GlowTTS: python3 -m trainer.distribute --script train.py --gpus "0,1"

  3. Run Vocoder: python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"

Expected behavior

No response

Logs

Traceback (most recent call last):
  File "train_vocoder.py", line 44, in <module>
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
  File "/apps/tts/Trainer/trainer/trainer.py", line 460, in __init__
    self.config.distributed_url,
  File "/apps/tts/Trainer/trainer/utils/distributed.py", line 62, in init_distributed
    group_name=group_name,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-40GB",
            "NVIDIA A100-SXM4-40GB",
            "NVIDIA A100-SXM4-40GB",
            "NVIDIA A100-SXM4-40GB"
        ],
        "available": true,
        "version": "11.5"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu115",
        "TTS": "0.7.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#61~18.04.3-Ubuntu SMP Fri Oct 1 14:04:01 UTC 2021"
    }
}

Additional context

First command starts with this log:

['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=0']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=1']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=2']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=3']
 > Using CUDA: True
 > Number of GPUs: 4

Second command starts with this log:

['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=0']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=1']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=2']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=3']
 > Using CUDA: True
 > Number of GPUs: 4
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (12 by maintainers)

Most upvoted comments

@erogol @Dapwner

If you request GPU 0 in your list of GPUs, it will be used correctly, as in the case with my first command, but otherwise, it’s the first GPU id in your list which becomes the master. Then there is a mismatch between a GPU ID (GPU ID is being used in the current released code) and the rank (rank 0 is treated as the master). This also explains the test @Dapwner ran:

Interestingly, I have tried to run the code on our nvidia dgx station (4 V100 gpus) with the following gpus: 0 + 1, 0 + 3, 1 + 3, 1 + 2 and it turned out that the error did not appear in the first two cases (when gpu 0 was included), but did occur in the latter two (when gpu 0 was not present).

Basically, the exception happens on the distributed training with GPUs list starting at 1 and up. Essentially, they run master-less. No model files are being stored by the master, then this exception happens - FileNotFoundError.

So, in my second example I used GPU list 5,6,7:

/apps/tts/TTS # nohup python3 -m trainer.distribute --script train_hifigan_vocoder_en.py --gpus "5,6,7" --coqpit.distributed_url "tcp://localhost:54322" </dev/null > hifigan_en.log 2>&1 &

A rank is printed correctly in the logs initially - GPU 5 gets rank 0. But then during the training at runtime, once it is run on the GPUs, the pytorch gets GPU ID, instead of the rank using the current released code, so it looses the master all together, thus the ghost processes.

@lexkoro @erogol @Dapwner

I ended up debugging this more at runtime and tested a fix.

It turns out that the current Trainer distributed.py behavior is using a GPU id for identifying the current device rank coming from one of the env vars. I opened a PR for the Trainer project. The proposed behavior is using the torch.distributed.get_rank()

Thank you, @erogol . Is it distributed_url parameter?