mmdetection: RuntimeError:Expected to have finished reduction in the prior iteration before starting a new one

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The bug has not been fixed in the latest version.

Describe the bug A clear and concise description of what the bug is.

Reproduction

  1. What command or script did you run? I have change the config name from faster_rcnn_r50_fpn_1x.py to element.py
CUDA_VISIBLE_DEVICES=1,2,3 ./tools/dist_train.sh configs/element.py 3 --autoscale-lr
  1. Did you make any modifications on the code or config? Did you understand what you have modified? only num_classes and work_dir in config

  2. What dataset did you use? my own dataset which is made the same as VOC format Environment image

  3. Please run python mmdet/utils/collect_env.py to collect necessary environment infomation and paste it here.

  4. You may add addition that may be helpful for locating the problem, such as

    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f92f4501441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f92f4500d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f92f4de983c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7f92f4ddf2bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7f92f484acfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7f92f8173830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)

Traceback (most recent call last):
  File "./tools/train.py", line 142, in <module>
    main()
  File "./tools/train.py", line 138, in main
    meta=meta)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 102, in train_detector
    meta=meta)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 171, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 371, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 275, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 75, in batch_processor
    losses = model(**data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 392, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fcaf0f72441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fcaf0f71d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7fcaf185a83c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7fcaf18502bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7fcaf12bbcfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7fcaf4be4830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)

^CTraceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 228, in main
    process.wait()
  File "/usr/lib/python3.6/subprocess.py", line 1457, in wait
    (pid, sts) = self._try_wait(0)
  File "/usr/lib/python3.6/subprocess.py", line 1404, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
root@83403c5335c7:mmdetection_v2# ^C

Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 33 (3 by maintainers)

Commits related to this issue

Most upvoted comments

I met the same issue. But i solved it. The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function, but in forward function I only use 4 of them. When I use all of them, the problem was solved. This is my supposed conclusion: you should use all output of each module in forward function.

This was helpful. I encountered the same error message in a custom architecture. Here is how you solve it without changing the module: If you define 5 layers, but only use the output of the 4th layer to calculate a specific loss, then you can solve the problem by multiplying the output of the 5th layer with zero and adding it to the loss. This way, you trick pytorch into believing that all parameters contribute to the loss. Problem solved. Deleting the 5th layer is not an option in my case, because I need the output of this layer in most training steps (but not all).

loss = your_loss_function(ouput_layer_4) + 0*output_layer_5.mean()

@SystemErrorWang I am also facing the same problem. When i set find_unused_parameters = cfg.get('find_unused_parameters', True), then the error disappeared, but my training process got stuck.

I met the same issue. But i solved it. The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function, but in forward function I only use 4 of them. When I use all of them, the problem was solved. This is my supposed conclusion: you should use all output of each module in forward function.

I am using the latest version of mmdetection, but still it is showing error. And when i set find_unused_parameters = True, error disappears but training freezes. Can anyone please help in solving it.

Traceback (most recent call last):
  File "./tools/train.py", line 161, in <module>
    main()
  File "./tools/train.py", line 157, in main
    meta=meta)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 179, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 383, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 282, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 74, in batch_processor
    losses = model(**data)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 464, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:514)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x14e1cc446193 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x731 (0x14e217e956f1 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0xa168ea (0x14e217e818ea in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x295a74 (0x14e217700a74 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x56198c3af004 in /home/madhav3101/torch-env/bin/python3)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x56198c3af121 in /home/madhav3101/torch-env/bin/python3)
frame #6: _PyEval_EvalFrameDefault + 0x532e (0x56198c40b40e in /home/madhav3101/torch-env/bin/python3)
frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #8: _PyFunction_FastCallDict + 0x3d8 (0x56198c34d1e8 in /home/madhav3101/torch-env/bin/python3)
frame #9: _PyObject_Call_Prepend + 0x63 (0x56198c363cb3 in /home/madhav3101/torch-env/bin/python3)
frame #10: PyObject_Call + 0x6e (0x56198c3587de in /home/madhav3101/torch-env/bin/python3)
frame #11: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #13: _PyFunction_FastCallDict + 0x3d8 (0x56198c34d1e8 in /home/madhav3101/torch-env/bin/python3)
frame #14: _PyObject_Call_Prepend + 0x63 (0x56198c363cb3 in /home/madhav3101/torch-env/bin/python3)
frame #15: <unknown function> + 0x170cca (0x56198c3a6cca in /home/madhav3101/torch-env/bin/python3)
frame #16: PyObject_Call + 0x6e (0x56198c3587de in /home/madhav3101/torch-env/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #19: _PyFunction_FastCallDict + 0x3d8 (0x56198c34d1e8 in /home/madhav3101/torch-env/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #22: _PyFunction_FastCallDict + 0x1d4 (0x56198c34cfe4 in /home/madhav3101/torch-env/bin/python3)
frame #23: _PyObject_Call_Prepend + 0x63 (0x56198c363cb3 in /home/madhav3101/torch-env/bin/python3)
frame #24: PyObject_Call + 0x6e (0x56198c3587de in /home/madhav3101/torch-env/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1e3e (0x56198c407f1e in /home/madhav3101/torch-env/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x56198c3ae337 in /home/madhav3101/torch-env/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x535 (0x56198c406615 in /home/madhav3101/torch-env/bin/python3)
frame #29: _PyEval_EvalCodeWithName + 0xba9 (0x56198c34c7c9 in /home/madhav3101/torch-env/bin/python3)
frame #30: _PyFunction_FastCallKeywords + 0x387 (0x56198c3ae337 in /home/madhav3101/torch-env/bin/python3)
frame #31: _PyEval_EvalFrameDefault + 0x14f5 (0x56198c4075d5 in /home/madhav3101/torch-env/bin/python3)
frame #32: _PyFunction_FastCallKeywords + 0xfb (0x56198c3ae0ab in /home/madhav3101/torch-env/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x6f6 (0x56198c4067d6 in /home/madhav3101/torch-env/bin/python3)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x56198c34bf19 in /home/madhav3101/torch-env/bin/python3)
frame #35: PyEval_EvalCodeEx + 0x44 (0x56198c34cdd4 in /home/madhav3101/torch-env/bin/python3)
frame #36: PyEval_EvalCode + 0x1c (0x56198c34cdfc in /home/madhav3101/torch-env/bin/python3)
frame #37: <unknown function> + 0x22f9e4 (0x56198c4659e4 in /home/madhav3101/torch-env/bin/python3)
frame #38: PyRun_FileExFlags + 0xa1 (0x56198c46fbd1 in /home/madhav3101/torch-env/bin/python3)
frame #39: PyRun_SimpleFileExFlags + 0x1c3 (0x56198c46fdc3 in /home/madhav3101/torch-env/bin/python3)
frame #40: <unknown function> + 0x23aedb (0x56198c470edb in /home/madhav3101/torch-env/bin/python3)
frame #41: _Py_UnixMain + 0x3c (0x56198c470fbc in /home/madhav3101/torch-env/bin/python3)
frame #42: __libc_start_main + 0xf0 (0x14e21e95f830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #43: <unknown function> + 0x1dfed2 (0x56198c415ed2 in /home/madhav3101/torch-env/bin/python3)

Traceback (most recent call last):
  File "./tools/train.py", line 161, in <module>
    main()
  File "./tools/train.py", line 157, in main
    meta=meta)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 179, in train_detector
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 383, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/mmcv/runner/runner.py", line 282, in train
    self.model, data_batch, train_mode=True, **kwargs)
  File "/home/madhav3101/pytorch-codes/mmdetection_v2/mmdetection/mmdet/apis/train.py", line 74, in batch_processor
    losses = model(**data)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 464, in forward
    self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:514)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x1503c9261193 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x731 (0x150414cb06f1 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0xa168ea (0x150414c9c8ea in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x295a74 (0x15041451ba74 in /home/madhav3101/torch-env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x56157b7d7004 in /home/madhav3101/torch-env/bin/python3)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x56157b7d7121 in /home/madhav3101/torch-env/bin/python3)
frame #6: _PyEval_EvalFrameDefault + 0x532e (0x56157b83340e in /home/madhav3101/torch-env/bin/python3)
frame #7: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #8: _PyFunction_FastCallDict + 0x3d8 (0x56157b7751e8 in /home/madhav3101/torch-env/bin/python3)
frame #9: _PyObject_Call_Prepend + 0x63 (0x56157b78bcb3 in /home/madhav3101/torch-env/bin/python3)
frame #10: PyObject_Call + 0x6e (0x56157b7807de in /home/madhav3101/torch-env/bin/python3)
frame #11: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #12: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #13: _PyFunction_FastCallDict + 0x3d8 (0x56157b7751e8 in /home/madhav3101/torch-env/bin/python3)
frame #14: _PyObject_Call_Prepend + 0x63 (0x56157b78bcb3 in /home/madhav3101/torch-env/bin/python3)
frame #15: <unknown function> + 0x170cca (0x56157b7cecca in /home/madhav3101/torch-env/bin/python3)
frame #16: PyObject_Call + 0x6e (0x56157b7807de in /home/madhav3101/torch-env/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #19: _PyFunction_FastCallDict + 0x3d8 (0x56157b7751e8 in /home/madhav3101/torch-env/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #22: _PyFunction_FastCallDict + 0x1d4 (0x56157b774fe4 in /home/madhav3101/torch-env/bin/python3)
frame #23: _PyObject_Call_Prepend + 0x63 (0x56157b78bcb3 in /home/madhav3101/torch-env/bin/python3)
frame #24: PyObject_Call + 0x6e (0x56157b7807de in /home/madhav3101/torch-env/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1e3e (0x56157b82ff1e in /home/madhav3101/torch-env/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x56157b7d6337 in /home/madhav3101/torch-env/bin/python3)
frame #28: _PyEval_EvalFrameDefault + 0x535 (0x56157b82e615 in /home/madhav3101/torch-env/bin/python3)
frame #29: _PyEval_EvalCodeWithName + 0xba9 (0x56157b7747c9 in /home/madhav3101/torch-env/bin/python3)
frame #30: _PyFunction_FastCallKeywords + 0x387 (0x56157b7d6337 in /home/madhav3101/torch-env/bin/python3)
frame #31: _PyEval_EvalFrameDefault + 0x14f5 (0x56157b82f5d5 in /home/madhav3101/torch-env/bin/python3)
frame #32: _PyFunction_FastCallKeywords + 0xfb (0x56157b7d60ab in /home/madhav3101/torch-env/bin/python3)
frame #33: _PyEval_EvalFrameDefault + 0x6f6 (0x56157b82e7d6 in /home/madhav3101/torch-env/bin/python3)
frame #34: _PyEval_EvalCodeWithName + 0x2f9 (0x56157b773f19 in /home/madhav3101/torch-env/bin/python3)
frame #35: PyEval_EvalCodeEx + 0x44 (0x56157b774dd4 in /home/madhav3101/torch-env/bin/python3)
frame #36: PyEval_EvalCode + 0x1c (0x56157b774dfc in /home/madhav3101/torch-env/bin/python3)
frame #37: <unknown function> + 0x22f9e4 (0x56157b88d9e4 in /home/madhav3101/torch-env/bin/python3)
frame #38: PyRun_FileExFlags + 0xa1 (0x56157b897bd1 in /home/madhav3101/torch-env/bin/python3)
frame #39: PyRun_SimpleFileExFlags + 0x1c3 (0x56157b897dc3 in /home/madhav3101/torch-env/bin/python3)
frame #40: <unknown function> + 0x23aedb (0x56157b898edb in /home/madhav3101/torch-env/bin/python3)
frame #41: _Py_UnixMain + 0x3c (0x56157b898fbc in /home/madhav3101/torch-env/bin/python3)
frame #42: __libc_start_main + 0xf0 (0x15041b77a830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #43: <unknown function> + 0x1dfed2 (0x56157b83ded2 in /home/madhav3101/torch-env/bin/python3)

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/home/madhav3101/miniconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/madhav3101/miniconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/madhav3101/torch-env/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/madhav3101/torch-env/bin/python3', '-u', './tools/train.py', '--local_rank=1', 'configs/dcn/db_cascade_mask_rcnn_r101_fpn_dconv_c3-c5_1x_coco.py', '--launcher', 'pytorch', '--work-dir', '/ssd_scratch/cvit/madhav/train_dataset/coco/logs/', '--gpus', '2']' returned non-zero exit status 1.

I also got the same problem. Even I set the flag find_unused_parameters=True, the problem doesn’t disappear. The training will keep freezing without any error logs. Also, I found that the code you published half-year ago worked really well as @jajajajaja121 mentioned. I hope we can fix this issue soon.

I am getting the same error after adding the following code to fpn.py: (I want to freeze the FPN weights)

def _freeze_stages(self):
    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            m.eval()
            for p in m.parameters():
                p.requires_grad = False

def train(self, mode=True):
    super(FPN, self).train(mode)
    if self.freeze_weights:
        self._freeze_stages()

Setting find_unused_parameters = True also solved my problem.

Freeze the layers during the initialization or before distributing the model with MMDistributedDataParallel will solve the issue!

thank you! before I raise this error issue, I’ve tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up!
hope you can pay some attention to this issue. By the way, I notice that the code you published about a half year ago worked well. Is this a point you can make use of? thank you again!

@vincentwei0919 I have a similar problem

I am getting the same error after adding the following code to fpn.py: (I want to freeze the FPN weights)

def _freeze_stages(self):
    for m in self.modules():
        if isinstance(m, nn.Conv2d):
            m.eval()
            for p in m.parameters():
                p.requires_grad = False

def train(self, mode=True):
    super(FPN, self).train(mode)
    if self.freeze_weights:
        self._freeze_stages()

Setting find_unused_parameters = True also solved my problem.

thank you! before I raise this error issue, I’ve tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up! hope you can pay some attention to this issue. By the way, I notice that the code you published about a half year ago worked well. Is this a point you can make use of? thank you again!

I also meet the same situation, and when I use non_distributed training, but use two card, it will raise valueerror All dicts must have the same number of keys.