MiniGPT-4: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

when i run this command: torchrun --nproc-per-node 1 --master_port 25641 train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml

this error occurs, how can i fix it?

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 81571) of binary: /home/tiger/miniconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/home/tiger/miniconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-19_16:43:27
  host      : n136-117-136.byted.org
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 81571)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
========================================================================================================================

About this issue

Original URL
State: open
Created a year ago
Comments: 17

Most upvoted comments

The error you mentioned earlier, torch.distributed.elastic.multiprocessing.errors.ChildFailedError, typically occurs when one of the child processes launched by torchrun encounters an error and fails to execute properly. It is difficult to pinpoint the exact cause of the error. However, here are a few possible reasons and solutions you can consider: Resource allocation: Ensure that your system has enough resources (e.g., CPU, GPU, memory) to accommodate the requested number of child processes.

Data or code issues: Check if there are any data-related issues, such as corrupted or incompatible data. Also, review your code for any potential issues that could cause errors during training. Make sure your code is compatible with the version of PyTorch and other dependencies you are using.

Debugging the child process: Try to gather more information about the error in the child process. You can modify your code to catch and print out the specific error message or traceback for the failed child process. This will help you narrow down the issue and provide more context for troubleshooting.

Updating PyTorch and dependencies: Make sure you are using the latest version of PyTorch and related dependencies. Check for any updates or bug fixes that may address the issue you’re facing. It’s also a good practice to ensure that all the dependencies in your environment are compatible with each other.

Check for known issues or bugs: Search online forums, issue trackers, or the official PyTorch documentation for any known issues related to the torch.distributed.elastic.multiprocessing module. It’s possible that the error you’re encountering is a known issue with an existing solution or workaround.

+65

abhijeetGithu on May 22, 2023

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 198787) of binary: /home/ocr/anaconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/home/ocr/anaconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-19_17:21:14
  host      : ai2
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 198787)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 198787

same error

Hi, I am having the same error while trying to Train TrOCR on multi-gpu single node setup. My problem is not the RAM as i have 1.8TB available memory, but still i face this error. Also i would like to point out that this particular error in the quoted reply is not as same as the original one. The exit code here is -9 as opposed to 1 in the original one. I am also getting -9 in my case, and i am not being able to find any reason behind it. The error is thrown randomly at the start of some epoch. Please help me with any possible solutins if you can.

+12

AnustupOCR on Jul 26, 2023

i find the seem issue in hugging face, it’s because of ram is not sufficient. https://discuss.huggingface.co/t/torch-distributed-elastic-multiprocessing-errors-childfailederror/28242

+10

IronSpiderMan on Jul 22, 2023

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 198787) of binary: /home/ocr/anaconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/home/ocr/anaconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-19_17:21:14
  host      : ai2
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 198787)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 198787

same error

yuanlisky on May 19, 2023

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./run.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-09-22_18:11:04 host : xxx rank : 0 (local_rank: 0) exitcode : -11 (pid: 1775061) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 1775061

same error

djaym7 on Sep 22, 2023