transformers: Socket Timeout when using DDP

System Info

- `transformers` version: 4.17.0.dev0
- Platform: Linux-4.15.0-176-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- PyTorch version (GPU?): 1.8.2 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes (run_summarization.py script)

Who can help?

@patrickvonplaten @patil-suraj

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

I’m constructing a dataset (.parquet format) that is similar to json format, but has other fields to construct graph for examples in the dataset. When I’m training the model in DDP mode (distributed), I’m getting RuntimeError: Socket Timeout. Here is the full stack:

Running tokenizer on train dataset #0:  24%|███████████████████████████████████████████████▌                                                                                                                                                     | 7/29 [28:27<1:46:58, 291.73s/ba]Traceback (most recent call last): #1:  24%|███████████████████████████████████████████████▌                                                                                                                                                     | 7/29 [28:54<1:46:07, 289.45s/ba]
  File "examples/pytorch/summarization/run_summarization.py", line 987, in <module>████████▌                                                                                                                                                     | 7/29 [30:24<1:49:35, 298.88s/ba]
    main()kenizer on train dataset #3:  24%|███████████████████████████████████████████████▌                                                                                                                                                     | 7/29 [28:46<1:43:47, 283.05s/ba]
  File "examples/pytorch/summarization/run_summarization.py", line 791, in main█████▊                                                                                                                                                            | 6/29 [27:32<1:57:16, 305.93s/ba]
    with training_args.main_process_first(desc="train dataset map pre-processing"):████████▌                                                                                                                                                     | 7/29 [27:45<1:42:39, 279.97s/ba]
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/contextlib.py", line 113, in __enter__                                                                                                                                               | 6/29 [26:27<1:54:13, 297.97s/ba]
    return next(self.gen)n dataset #7:  21%|████████████████████████████████████████▊                                                                                                                                                            | 6/29 [25:48<1:51:59, 292.15s/ba]
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/transformers/training_args.py", line 1264, in main_process_first                                                                                                       | 6/29 [26:27<1:52:41, 293.96s/ba]
    torch.distributed.barrier()set #9:  24%|███████████████████████████████████████████████▌                                                                                                                                                     | 7/29 [29:50<1:45:55, 288.90s/ba]
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2420, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: Socket Timeout
Killing subprocess 62044
Killing subprocess 62045
Traceback (most recent call last):
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/sajad/anaconda3/envs/myenv-py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main

Expected behavior

Running the preprocessing function on each training split.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (3 by maintainers)

Most upvoted comments

Looks like that the process gets killed due to torch.distributed.launch/run timeout of 30 minutes? (https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group)

I had the same problem, where my job would be stopped when using DDP due to the long mapping/tokenization.

gugarosa on Jul 7, 2022

You are not using the ddp_timeout training argument to put a higher value than 30 minutes, so if you have a big dataset to preprocess, you get this error. Use a bigger value to solve this error or preprocess your dataset in a non-distributed fashion.

sgugger on Nov 14, 2022

i have a similar task, and my torch.distributed launch gets interupted due to the 30mins timeout.

in my case when i run the script normally like, python run.py it gets cached, but when i run it in torch.distributed launch it isnt getting cached, and the entire preprocessing step occurs again, and gets timedout

StephennFernandes on Jul 14, 2022

I met the same error. I tried to pre-train with 25GB korean corpus data using example/run_clm.py. I haven’t tested it in an environment not using DDP yet, but I think this problem is related to corpus. Because there was no problem when it was a small corpus. The process killed about 30000 ~ 32000 of 85249. The tokenizer type is Byte-level BPE.

I succeeded in pre-training without DDP. Running tokenizer was finished well and I could use this cache data with DDP after tokenizing. I don’t know the cause yet, but this problem seems to be related to DDP.

My English is not that great. Nevertheless I want to solve this problem.

tospirits on Jun 7, 2022