pytorch-lightning: Help with understanding unknown 'c10::Error' thrown during DDP training
π Bug
This is not going to be a good and concise description, since I have no way of reproducing the error with 100% certainty. However, more and more I get these types of c10::Error instances thrown during training, both during the training phase and during the validation phase. I am running DDP on a single-node with 20 CPUs and 4 GPUs. The error is not located to any specific node.
I donβt think itβs necessarily a PyTorch Lightning issue, so I am unsure whether it is appropriate here, but maybe someone has had similar experiences?
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: initialization error
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fcc65f5177d in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7fcc661a2370 in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fcc65f3db1d in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x53f29a (0x7fcca3ab529a in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1241eb (0x55fadc0131eb in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #5: <unknown function> + 0x1243e6 (0x55fadc0133e6 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #6: _PyObject_GC_Malloc + 0x84 (0x55fadc0134d4 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #7: _PyObject_GC_New + 0xe (0x55fadc0134fe in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #8: PyFunction_NewWithQualName + 0x2e (0x55fadc0525be in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #9: _PyEval_EvalFrameDefault + 0x1918 (0x55fadc0bd9b8 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #10: _PyEval_EvalCodeWithName + 0xc30 (0x55fadc004160 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #11: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x416 (0x55fadc0bc4b6 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #13: _PyEval_EvalCodeWithName + 0x2f9 (0x55fadc003829 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #14: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #15: _PyEval_EvalFrameDefault + 0x416 (0x55fadc0bc4b6 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #16: _PyEval_EvalCodeWithName + 0xc30 (0x55fadc004160 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x4a89 (0x55fadc0c0b29 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #19: _PyEval_EvalCodeWithName + 0xc30 (0x55fadc004160 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #22: _PyFunction_FastCallDict + 0x10b (0x55fadc00485b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x1e4a (0x55fadc0bdeea in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #24: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #26: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #28: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #30: _PyFunction_FastCallDict + 0x10b (0x55fadc00485b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #31: _PyObject_Call_Prepend + 0x63 (0x55fadc0234d3 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #32: <unknown function> + 0x16bd5a (0x55fadc05ad5a in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #33: _PyObject_FastCallKeywords + 0x128 (0x55fadc05b968 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #34: _PyEval_EvalFrameDefault + 0x49e6 (0x55fadc0c0a86 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #35: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x4a89 (0x55fadc0c0b29 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #38: _PyEval_EvalFrameDefault + 0x4a89 (0x55fadc0c0b29 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #39: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #41: _PyEval_EvalCodeWithName + 0x5da (0x55fadc003b0a in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #42: _PyFunction_FastCallDict + 0x1d5 (0x55fadc004925 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #43: _PyObject_Call_Prepend + 0x63 (0x55fadc0234d3 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #44: <unknown function> + 0x16bd5a (0x55fadc05ad5a in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #45: _PyObject_FastCallKeywords + 0x128 (0x55fadc05b968 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x49e6 (0x55fadc0c0a86 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #47: _PyFunction_FastCallDict + 0x10b (0x55fadc00485b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #48: <unknown function> + 0x16349a (0x55fadc05249a in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #49: PyObject_GetIter + 0x16 (0x55fadc002d76 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #50: <unknown function> + 0x1b91dc (0x55fadc0a81dc in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #51: _PyMethodDef_RawFastCallKeywords + 0x1e4 (0x55fadc054884 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #52: _PyCFunction_FastCallKeywords + 0x21 (0x55fadc054a31 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x46f5 (0x55fadc0c0795 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #54: <unknown function> + 0x16cfb4 (0x55fadc05bfb4 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #55: <unknown function> + 0x177b6f (0x55fadc066b6f in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #56: <unknown function> + 0x16d575 (0x55fadc05c575 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #57: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x55fadc054789 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #58: _PyCFunction_FastCallKeywords + 0x21 (0x55fadc054a31 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #59: _PyEval_EvalFrameDefault + 0x46f5 (0x55fadc0c0795 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #60: <unknown function> + 0x16cfb4 (0x55fadc05bfb4 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #61: _PyEval_EvalFrameDefault + 0xa46 (0x55fadc0bcae6 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #62: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python)
frame #63: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python)
Traceback (most recent call last):
File "/home/users/alexno/sleep-staging/train.py", line 194, in <module>
run_training()
File "/home/users/alexno/sleep-staging/train.py", line 147, in run_training
trainer.fit(model, dm)
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train
results = self.train_or_test()
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train
self.train_loop.run_training_epoch()
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 687, in run_training_batch
self.log_training_step_metrics(opt_closure_result, batch_callback_metrics, batch_log_metrics)
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 501, in log_training_step_metrics
self.trainer.logger_connector.add_progress_bar_metrics(step_pbar_metrics)
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 105, in add_progress_bar_metrics
v = v.item()
File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 51292) is killed by signal: Aborted.
Environment
* CUDA:
- GPU:
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- Tesla V100-SXM2-16GB
- available: True
- version: 10.2
* Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.0.2
- tensorboard: 2.3.0
- tqdm: 4.51.0
* System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.9
- version: #1 SMP Mon Jul 29 17:46:05 UTC 2019
Additional context
Please let me know if this is more suited towards the PyTorch team.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17 (2 by maintainers)
Me too, but I was able to solve it by adding the
rank_zero_only=Trueoption mentioned in the comment here https://github.com/PyTorchLightning/pytorch-lightning/issues/8821#issuecomment-902402784 . Iβm also using ddp, and it has been occurring exactly as described here since I updated to 1.4. (currently 1.4.2)Seeing the same issue here with pytorch 1.9, cudatoolkit 11.0.3, pytorch-lightning 1.4.1
I have exactly same issue but only during validation and specifically with
ddpand not withdp. PL version is1.4.4I had exactly the same error and removing the log solved this for me too. Does anyone know why this would be?