pytorch-lightning: Help with understanding unknown 'c10::Error' thrown during DDP training

πŸ› Bug

This is not going to be a good and concise description, since I have no way of reproducing the error with 100% certainty. However, more and more I get these types of c10::Error instances thrown during training, both during the training phase and during the validation phase. I am running DDP on a single-node with 20 CPUs and 4 GPUs. The error is not located to any specific node.

I don’t think it’s necessarily a PyTorch Lightning issue, so I am unsure whether it is appropriate here, but maybe someone has had similar experiences?

terminate called after throwing an instance of 'c10::Error' 
  what():  CUDA error: initialization error 
Exception raised from insert_events at /opt/conda/conda-bld/pytorch_1595629427478/work/c10/cuda/CUDACachingAllocator.cpp:717 (most recent call first): 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fcc65f5177d in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libc10.so) 
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1130 (0x7fcc661a2370 in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) 
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fcc65f3db1d in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libc10.so) 
frame #3: <unknown function> + 0x53f29a (0x7fcca3ab529a in /home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/lib/libtorch_python.so) 
frame #4: <unknown function> + 0x1241eb (0x55fadc0131eb in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #5: <unknown function> + 0x1243e6 (0x55fadc0133e6 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #6: _PyObject_GC_Malloc + 0x84 (0x55fadc0134d4 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #7: _PyObject_GC_New + 0xe (0x55fadc0134fe in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #8: PyFunction_NewWithQualName + 0x2e (0x55fadc0525be in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #9: _PyEval_EvalFrameDefault + 0x1918 (0x55fadc0bd9b8 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #10: _PyEval_EvalCodeWithName + 0xc30 (0x55fadc004160 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #11: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #12: _PyEval_EvalFrameDefault + 0x416 (0x55fadc0bc4b6 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #13: _PyEval_EvalCodeWithName + 0x2f9 (0x55fadc003829 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #14: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #15: _PyEval_EvalFrameDefault + 0x416 (0x55fadc0bc4b6 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #16: _PyEval_EvalCodeWithName + 0xc30 (0x55fadc004160 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #17: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #18: _PyEval_EvalFrameDefault + 0x4a89 (0x55fadc0c0b29 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #19: _PyEval_EvalCodeWithName + 0xc30 (0x55fadc004160 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #20: _PyFunction_FastCallKeywords + 0x387 (0x55fadc054107 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #21: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #22: _PyFunction_FastCallDict + 0x10b (0x55fadc00485b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #23: _PyEval_EvalFrameDefault + 0x1e4a (0x55fadc0bdeea in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #24: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #25: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #26: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #27: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #28: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #29: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #30: _PyFunction_FastCallDict + 0x10b (0x55fadc00485b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #31: _PyObject_Call_Prepend + 0x63 (0x55fadc0234d3 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #32: <unknown function> + 0x16bd5a (0x55fadc05ad5a in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #33: _PyObject_FastCallKeywords + 0x128 (0x55fadc05b968 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #34: _PyEval_EvalFrameDefault + 0x49e6 (0x55fadc0c0a86 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #35: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #36: _PyEval_EvalFrameDefault + 0x4a89 (0x55fadc0c0b29 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #37: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #38: _PyEval_EvalFrameDefault + 0x4a89 (0x55fadc0c0b29 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #39: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #40: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #41: _PyEval_EvalCodeWithName + 0x5da (0x55fadc003b0a in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #42: _PyFunction_FastCallDict + 0x1d5 (0x55fadc004925 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #43: _PyObject_Call_Prepend + 0x63 (0x55fadc0234d3 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #44: <unknown function> + 0x16bd5a (0x55fadc05ad5a in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #45: _PyObject_FastCallKeywords + 0x128 (0x55fadc05b968 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #46: _PyEval_EvalFrameDefault + 0x49e6 (0x55fadc0c0a86 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #47: _PyFunction_FastCallDict + 0x10b (0x55fadc00485b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #48: <unknown function> + 0x16349a (0x55fadc05249a in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #49: PyObject_GetIter + 0x16 (0x55fadc002d76 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #50: <unknown function> + 0x1b91dc (0x55fadc0a81dc in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #51: _PyMethodDef_RawFastCallKeywords + 0x1e4 (0x55fadc054884 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #52: _PyCFunction_FastCallKeywords + 0x21 (0x55fadc054a31 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #53: _PyEval_EvalFrameDefault + 0x46f5 (0x55fadc0c0795 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #54: <unknown function> + 0x16cfb4 (0x55fadc05bfb4 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #55: <unknown function> + 0x177b6f (0x55fadc066b6f in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #56: <unknown function> + 0x16d575 (0x55fadc05c575 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #57: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x55fadc054789 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #58: _PyCFunction_FastCallKeywords + 0x21 (0x55fadc054a31 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #59: _PyEval_EvalFrameDefault + 0x46f5 (0x55fadc0c0795 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #60: <unknown function> + 0x16cfb4 (0x55fadc05bfb4 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #61: _PyEval_EvalFrameDefault + 0xa46 (0x55fadc0bcae6 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #62: _PyFunction_FastCallKeywords + 0xfb (0x55fadc053e7b in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
frame #63: _PyEval_EvalFrameDefault + 0x6a0 (0x55fadc0bc740 in /home/groups/mignot/miniconda3/envs/nml/bin/python) 
Traceback (most recent call last): 
  File "/home/users/alexno/sleep-staging/train.py", line 194, in <module> 
    run_training() 
  File "/home/users/alexno/sleep-staging/train.py", line 147, in run_training 
    trainer.fit(model, dm) 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit 
    results = self.accelerator_backend.train() 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train 
    results = self.ddp_train(process_idx=self.task_idx, model=model) 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train 
    results = self.train_or_test() 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test 
    results = self.trainer.train() 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 483, in train 
    self.train_loop.run_training_epoch() 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch 
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 687, in run_training_batch 
    self.log_training_step_metrics(opt_closure_result, batch_callback_metrics, batch_log_metrics) 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 501, in log_training_step_metrics 
    self.trainer.logger_connector.add_progress_bar_metrics(step_pbar_metrics) 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 105, in add_progress_bar_metrics 
    v = v.item() 
  File "/home/groups/mignot/miniconda3/envs/nml/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler 
    _error_if_any_worker_fails() 
RuntimeError: DataLoader worker (pid 51292) is killed by signal: Aborted.

Environment

* CUDA:
        - GPU:
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.2
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.0.2
        - tensorboard:       2.3.0
        - tqdm:              4.51.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - 
        - processor:         x86_64
        - python:            3.7.9
        - version:           #1 SMP Mon Jul 29 17:46:05 UTC 2019

Additional context

Please let me know if this is more suited towards the PyTorch team.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 1
  • Comments: 17 (2 by maintainers)

Most upvoted comments

Me too, but I was able to solve it by adding the rank_zero_only=True option mentioned in the comment here https://github.com/PyTorchLightning/pytorch-lightning/issues/8821#issuecomment-902402784 . I’m also using ddp, and it has been occurring exactly as described here since I updated to 1.4. (currently 1.4.2)

Seeing the same issue here with pytorch 1.9, cudatoolkit 11.0.3, pytorch-lightning 1.4.1

I have exactly same issue but only during validation and specifically with ddp and not with dp. PL version is 1.4.4

I had exactly the same error and removing the log solved this for me too. Does anyone know why this would be?