mmsegmentation: pred_label and label mismatch errors occur when using multi-node multi-gpus
I use ‘‘gpu_collect’’ since my multi-node machines have no shared storage. I run the code on multi-node multi-gpus, the training is good, but when evaluation, error happens.
mask: torch.Size([360, 530])
pred_label: torch.Size([256, 256])
Traceback (most recent call last):
File "tools/train.py", line 166, in <module>
main()
File "tools/train.py", line 162, in main
meta=meta)
File "/opt/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
self.call_hook('after_train_iter')
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 215, in after_train_iter
self._do_evaluate(runner)
File "/opt/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 96, in _do_evaluate
key_score = self.evaluate(runner, results)
File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 311, in evaluate
results, logger=runner.logger, **self.eval_kwargs)
File "/opt/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
reduce_zero_label=self.reduce_zero_label)
File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 298, in eval_metrics
reduce_zero_label)
File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 129, in total_intersect_and_union
label_map, reduce_zero_label)
File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 79, in intersect_and_union
pred_label = pred_label[mask]
IndexError: The shape of the mask [360, 530] at index 0 does not match the shape of the indexed tensor [256, 256] at index 0
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=7', 'configs/swin/upernet_swin_tiny_patch4_window7_512x512_160k_ade20k_pretrain_224x224_1K.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
I guess this error is caused by the order mismatch of pred_label and label, referring to pr522. But this pr seems invalid.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15
Commits related to this issue
- [schedulers] hanlde dtype in add_noise (#767) * handle dtype in vae and image2image pipeline * handle dtype in add noise * don't modify vae and pipeline * remove the if — committed to aravind-h-v/mmsegmentation by patil-suraj 2 years ago
- Change OpenMMLab's to OpenMMLab in Repo Descriptions in README (#767) — committed to wjkim81/mmsegmentation by ly015 3 years ago
It works ! Great !
Thanks for your hard work !