MONAI: import cv2 + nvidia/pytorch:22.09-py3 + DistributedDataParallel. (FIND was unable to find an engine)
EDIT: the bug is reproducable in the newest nvidia/pytorch:22.09-py3 docker container, but is not reproducible in older container (older pytorch/cudnn)
Something in MetaTensor makes DistributedDataParallel fail (this is in addition to this bug https://github.com/Project-MONAI/MONAI/issues/5283)
For example this code fails
import torch.distributed as dist
import torch
from monai.data import MetaTensor
#from monai.config.type_definitions import NdarrayTensor
from torch.cuda.amp import autocast
torch.autograd.set_detect_anomaly(True)
def main():
ngpus_per_node = torch.cuda.device_count()
torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))
def main_worker(rank, ngpus_per_node):
print(f"rank {rank}")
dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=ngpus_per_node, rank=rank)
torch.backends.cudnn.benchmark = True
model = torch.nn.Conv3d(in_channels=1, out_channels=32, kernel_size=3, bias=True).to(rank)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=False)
x = torch.ones(1, 1, 192, 192, 192).to(rank)
with autocast(enabled=True):
out = model(x)
print("Done.", out.shape)
if __name__ == "__main__":
main()
with error
-- Process 6 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/mnt/amproj/Code/automl/tasks/hecktor22/autoconfig_segresnet/test_monai.py", line 29, in main_worker
out = model(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 613, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward
return F.conv3d(
RuntimeError: FIND was unable to find an engine to execute this computation
The MetaTensor is actually never used/initialized here, but something it it (or it’s imports) makes the code fail. Since we import MetaTensor everywhere, any code with it fails. I’ve traced it down to this import (inside of MetaTensor.py)
from monai.config.type_definitions import NdarrayTensor
importing this line also makes the code fail.
Somehow it confuses conv3d operation, and possibly other operations
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 16 (9 by maintainers)
Commits related to this issue
- 5269 5291 Update PyTorch base docker to 22.09 (#5293) Fixes #5269 #5291 . ### Description This PR updated the PyTorch base docker to 22.09. ### Types of changes <!--- Put an `x` in all the ... — committed to Project-MONAI/MONAI by Nic-Ma 2 years ago
- 5269 5291 Update PyTorch base docker to 22.09 (#5293) Fixes #5269 #5291 . ### Description This PR updated the PyTorch base docker to 22.09. ### Types of changes <!--- Put an `x` in all the ... — committed to Project-MONAI/MONAI by Nic-Ma 2 years ago
- 5269 5291 Update PyTorch base docker to 22.09 (#5293) Fixes #5269 #5291 . ### Description This PR updated the PyTorch base docker to 22.09. ### Types of changes <!--- Put an `x` in all the ... — committed to Project-MONAI/MONAI by Nic-Ma 2 years ago
I see, very good, thank you guys
It’s already been addressed by https://github.com/Project-MONAI/MONAI/pull/5293 (by not importing
cv2, https://github.com/Project-MONAI/MONAI/blob/dev/monai/__init__.py#L50), with a test case included. What Nic mentions is a possible alternative solution in casecv2is imported for some other purposes.Hi @myron @wyli ,
After more analysis, I found that this issue only occurs when you set:
torch.backends.cudnn.benchmark = TrueTo unblock your work, I think you can remove this line or set it toFalseso far.Thanks.
Hi @myron ,
The MONAI import logic is different, we import all the things even you only import one component: https://github.com/Project-MONAI/MONAI/blob/dev/monai/__init__.py#L48 So it may call the
import cv2somewhere in the codebase, for example: https://github.com/Project-MONAI/MONAI/blob/dev/monai/data/video_dataset.py#L28Thanks.
it seems it’s triggered by
import cv2, on driver 470.82.01 andnvcr.io/nvidia/pytorch:22.09-py3(the root cause is not really from monai…perhaps we report this to the framework team instead).To reproduce, launch
nvcr.io/nvidia/pytorch:22.09-py3, and runpython test.py, wheretest.pyhas the following content:output:
Hi @myron , @wyli ,
I tried to execute the test program on V100-32G with MONAI latest and 22.09 docker, got below output:
Thanks.