mmdetection: RuntimeError:Expected to have finished reduction in the prior iteration before starting a new one
Thanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug A clear and concise description of what the bug is.
Reproduction
- What command or script did you run? I have change the config name from faster_rcnn_r50_fpn_1x.py to element.py
CUDA_VISIBLE_DEVICES=1,2,3 ./tools/dist_train.sh configs/element.py 3 --autoscale-lr
-
Did you make any modifications on the code or config? Did you understand what you have modified? only num_classes and work_dir in config
-
What dataset did you use? my own dataset which is made the same as VOC format Environment

-
Please run
python mmdet/utils/collect_env.pyto collect necessary environment infomation and paste it here. -
You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as
$PATH,$LD_LIBRARY_PATH,$PYTHONPATH, etc.)
Error traceback
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f92f4501441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f92f4500d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7f92f4de983c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7f92f4ddf2bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7f92f484acfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7f92f8173830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)
Traceback (most recent call last):
File "./tools/train.py", line 142, in <module>
main()
File "./tools/train.py", line 138, in main
meta=meta)
File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 102, in train_detector
meta=meta)
File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 171, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 371, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/runner.py", line 275, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/detect/ww_detection/mmdetection_v2/mmdet/apis/train.py", line 75, in batch_processor
losses = model(**data)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 392, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fcaf0f72441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fcaf0f71d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x5ec (0x7fcaf185a83c in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #3: <unknown function> + 0x6c52bd (0x7fcaf18502bd in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x130cfc (0x7fcaf12bbcfc in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #5: _PyCFunction_FastCallKeywords + 0x1ac (0x4b33ec in /usr/local/bin/python)
frame #6: /usr/local/bin/python() [0x544be8]
frame #7: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #8: /usr/local/bin/python() [0x544a85]
frame #9: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #10: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #11: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #12: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #13: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #14: /usr/local/bin/python() [0x544a85]
frame #15: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #16: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #17: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #18: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #19: /usr/local/bin/python() [0x4cf4bf]
frame #20: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #22: /usr/local/bin/python() [0x544a85]
frame #23: PyEval_EvalCodeEx + 0x3e (0x54599e in /usr/local/bin/python)
frame #24: /usr/local/bin/python() [0x489dd6]
frame #25: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #27: /usr/local/bin/python() [0x544a85]
frame #28: _PyFunction_FastCallDict + 0x12a (0x54d9aa in /usr/local/bin/python)
frame #29: _PyObject_FastCallDict + 0x1e0 (0x4570c0 in /usr/local/bin/python)
frame #30: _PyObject_Call_Prepend + 0xca (0x4571ba in /usr/local/bin/python)
frame #31: PyObject_Call + 0x5c (0x456d9c in /usr/local/bin/python)
frame #32: _PyEval_EvalFrameDefault + 0x2d2a (0x548b9a in /usr/local/bin/python)
frame #33: /usr/local/bin/python() [0x544a85]
frame #34: /usr/local/bin/python() [0x544d37]
frame #35: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #36: /usr/local/bin/python() [0x544a85]
frame #37: /usr/local/bin/python() [0x544d37]
frame #38: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #39: /usr/local/bin/python() [0x544a85]
frame #40: /usr/local/bin/python() [0x544d37]
frame #41: _PyEval_EvalFrameDefault + 0xc9b (0x546b0b in /usr/local/bin/python)
frame #42: /usr/local/bin/python() [0x5440e1]
frame #43: /usr/local/bin/python() [0x544f91]
frame #44: _PyEval_EvalFrameDefault + 0x102d (0x546e9d in /usr/local/bin/python)
frame #45: /usr/local/bin/python() [0x544a85]
frame #46: PyEval_EvalCode + 0x23 (0x545913 in /usr/local/bin/python)
frame #47: PyRun_FileExFlags + 0x16f (0x42b41f in /usr/local/bin/python)
frame #48: PyRun_SimpleFileExFlags + 0xec (0x42b64c in /usr/local/bin/python)
frame #49: Py_Main + 0xd85 (0x43fa15 in /usr/local/bin/python)
frame #50: main + 0x162 (0x421b62 in /usr/local/bin/python)
frame #51: __libc_start_main + 0xf0 (0x7fcaf4be4830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #52: _start + 0x29 (0x421c39 in /usr/local/bin/python)
^CTraceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 228, in main
process.wait()
File "/usr/lib/python3.6/subprocess.py", line 1457, in wait
(pid, sts) = self._try_wait(0)
File "/usr/lib/python3.6/subprocess.py", line 1404, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt
root@83403c5335c7:mmdetection_v2# ^C
Bug fix If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 33 (3 by maintainers)
Commits related to this issue
- fix init_metrics and prune (#2153) — committed to liuhuiCNN/mmdetection by yghstill 3 years ago
This was helpful. I encountered the same error message in a custom architecture. Here is how you solve it without changing the module: If you define 5 layers, but only use the output of the 4th layer to calculate a specific loss, then you can solve the problem by multiplying the output of the 5th layer with zero and adding it to the loss. This way, you trick pytorch into believing that all parameters contribute to the loss. Problem solved. Deleting the 5th layer is not an option in my case, because I need the output of this layer in most training steps (but not all).
@SystemErrorWang I am also facing the same problem. When i set
find_unused_parameters = cfg.get('find_unused_parameters', True), then the error disappeared, but my training process got stuck.I met the same issue. But i solved it. The reason is that in my model class, I define a fpn module with 5 level output feature maps in the init function, but in forward function I only use 4 of them. When I use all of them, the problem was solved. This is my supposed conclusion: you should use all output of each module in forward function.
I am using the latest version of mmdetection, but still it is showing error. And when i set
find_unused_parameters = True, error disappears but training freezes. Can anyone please help in solving it.I also got the same problem. Even I set the flag
find_unused_parameters=True, the problem doesn’t disappear. The training will keep freezing without any error logs. Also, I found that the code you published half-year ago worked really well as @jajajajaja121 mentioned. I hope we can fix this issue soon.Freeze the layers during the initialization or before distributing the model with MMDistributedDataParallel will solve the issue!
thank you! before I raise this error issue, I’ve tried to add parameter as indicated by issue #2117. after I restart again, this error seems disappear. log info only shows the mismatch of parameters as usual! but my program just stopped and waited for something, or maybe processing something, but I am not really sure. and also, I checked my gpu usage, no model is loaded. so after about half an hour waiting, I gave up!
hope you can pay some attention to this issue. By the way, I notice that the code you published about a half year ago worked well. Is this a point you can make use of? thank you again!
@vincentwei0919 I have a similar problem
I am getting the same error after adding the following code to fpn.py: (I want to freeze the FPN weights)
Setting
find_unused_parameters = Truealso solved my problem.I also meet the same situation, and when I use non_distributed training, but use two card, it will raise valueerror All dicts must have the same number of keys.