mmsegmentation: pred_label and label mismatch errors occur when using multi-node multi-gpus

I use ‘‘gpu_collect’’ since my multi-node machines have no shared storage. I run the code on multi-node multi-gpus, the training is good, but when evaluation, error happens.

mask: torch.Size([360, 530])
pred_label: torch.Size([256, 256])
Traceback (most recent call last):
  File "tools/train.py", line 166, in <module>
    main()
  File "tools/train.py", line 162, in main
    meta=meta)
  File "/opt/mmsegmentation/mmseg/apis/train.py", line 116, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 66, in train
    self.call_hook('after_train_iter')
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 215, in after_train_iter
    self._do_evaluate(runner)
  File "/opt/mmsegmentation/mmseg/core/evaluation/eval_hooks.py", line 96, in _do_evaluate
    key_score = self.evaluate(runner, results)
  File "/home/.local/lib/python3.7/site-packages/mmcv/runner/hooks/evaluation.py", line 311, in evaluate
    results, logger=runner.logger, **self.eval_kwargs)
  File "/opt/mmsegmentation/mmseg/datasets/custom.py", line 344, in evaluate
    reduce_zero_label=self.reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 298, in eval_metrics
    reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 129, in total_intersect_and_union
    label_map, reduce_zero_label)
  File "/opt/mmsegmentation/mmseg/core/evaluation/metrics.py", line 79, in intersect_and_union
    pred_label = pred_label[mask]
IndexError: The shape of the mask [360, 530] at index 0 does not match the shape of the indexed tensor [256, 256] at index 0
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'tools/train.py', '--local_rank=7', 'configs/swin/upernet_swin_tiny_patch4_window7_512x512_160k_ade20k_pretrain_224x224_1K.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

I guess this error is caused by the order mismatch of pred_label and label, referring to pr522. But this pr seems invalid.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15

Commits related to this issue

Most upvoted comments

Hi @PeizeSun Please have a try #780. Looking forward to your feedback.

It works ! Great !

Thanks for your hard work !