MONAI: import cv2 + nvidia/pytorch:22.09-py3 + DistributedDataParallel. (FIND was unable to find an engine)

EDIT: the bug is reproducable in the newest nvidia/pytorch:22.09-py3 docker container, but is not reproducible in older container (older pytorch/cudnn)

Something in MetaTensor makes DistributedDataParallel fail (this is in addition to this bug https://github.com/Project-MONAI/MONAI/issues/5283)

For example this code fails

import torch.distributed as dist
import torch

from monai.data import MetaTensor
#from monai.config.type_definitions import NdarrayTensor

from torch.cuda.amp import autocast  
torch.autograd.set_detect_anomaly(True)

def main():

    ngpus_per_node = torch.cuda.device_count()
    torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))

def main_worker(rank, ngpus_per_node):

    print(f"rank {rank}")

    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=ngpus_per_node, rank=rank)
    torch.backends.cudnn.benchmark = True

    model = torch.nn.Conv3d(in_channels=1, out_channels=32, kernel_size=3, bias=True).to(rank)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=False)

    x = torch.ones(1, 1, 192, 192, 192).to(rank)
    with autocast(enabled=True):
        out = model(x)

    print("Done.", out.shape)

if __name__ == "__main__":
    main()

with error

-- Process 6 terminated with the following error:                                                                                                               
Traceback (most recent call last):                                                                                                                              
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap                                                               
    fn(i, *args)                                                                                                                                                
  File "/mnt/amproj/Code/automl/tasks/hecktor22/autoconfig_segresnet/test_monai.py", line 29, in main_worker                                                    
    out = model(x)                                                                                                                                              
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl                                                            
    return forward_call(*input, **kwargs)                                                                                                                       
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward                                                         
    output = self._run_ddp_forward(*inputs, **kwargs)                                                                                                           
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward                                                 
    return module_to_run(*inputs[0], **kwargs[0])                                                                                                               
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl                                                            
    return forward_call(*input, **kwargs)                                                                                                                       
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 613, in forward                                                                  
    return self._conv_forward(input, self.weight, self.bias)                                                                                                    
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward
    return F.conv3d(
RuntimeError: FIND was unable to find an engine to execute this computation

The MetaTensor is actually never used/initialized here, but something it it (or it’s imports) makes the code fail. Since we import MetaTensor everywhere, any code with it fails. I’ve traced it down to this import (inside of MetaTensor.py) from monai.config.type_definitions import NdarrayTensor

importing this line also makes the code fail.

Somehow it confuses conv3d operation, and possibly other operations

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (9 by maintainers)

Commits related to this issue

Most upvoted comments

I see, very good, thank you guys

It’s already been addressed by https://github.com/Project-MONAI/MONAI/pull/5293 (by not importing cv2, https://github.com/Project-MONAI/MONAI/blob/dev/monai/__init__.py#L50), with a test case included. What Nic mentions is a possible alternative solution in case cv2 is imported for some other purposes.

Hi @myron @wyli ,

After more analysis, I found that this issue only occurs when you set: torch.backends.cudnn.benchmark = True To unblock your work, I think you can remove this line or set it to False so far.

Thanks.

Hi @myron ,

The MONAI import logic is different, we import all the things even you only import one component: https://github.com/Project-MONAI/MONAI/blob/dev/monai/__init__.py#L48 So it may call the import cv2 somewhere in the codebase, for example: https://github.com/Project-MONAI/MONAI/blob/dev/monai/data/video_dataset.py#L28

Thanks.

it seems it’s triggered by import cv2, on driver 470.82.01 and nvcr.io/nvidia/pytorch:22.09-py3 (the root cause is not really from monai…perhaps we report this to the framework team instead).

To reproduce, launch nvcr.io/nvidia/pytorch:22.09-py3, and run python test.py, where test.py has the following content:

import torch.distributed as dist
import torch

import cv2

from torch.cuda.amp import autocast
torch.autograd.set_detect_anomaly(True)

def main():

    ngpus_per_node = torch.cuda.device_count()
    torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))

def main_worker(rank, ngpus_per_node):

    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=ngpus_per_node, rank=rank)
    torch.backends.cudnn.benchmark = True

    model = torch.nn.Conv3d(in_channels=1, out_channels=32, kernel_size=3, bias=True).to(rank)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank], output_device=rank, find_unused_parameters=False)

    x = torch.ones(1, 1, 192, 192, 192).to(rank)
    with autocast(enabled=True):
        out = model(x)

if __name__ == "__main__":
    main()

output:

root@3512928:/workspace# python test.py
/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py:9: UserWarning: is_namedtuple is deprecated, please use the python checks instead
  warnings.warn("is_namedtuple is deprecated, please use the python checks instead")
Traceback (most recent call last):
  File "test.py", line 27, in <module>
    main()
  File "test.py", line 12, in main
    torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/test.py", line 24, in main_worker
    out = model(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 613, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward
    return F.conv3d(
RuntimeError: FIND was unable to find an engine to execute this computation

Hi @myron , @wyli ,

I tried to execute the test program on V100-32G with MONAI latest and 22.09 docker, got below output:

root@apt-sh-ai:/workspace/data/medical/MONAI# python test_ddp.py 
rank 0
rank 1
2022-10-08 10:13:58,995 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-10-08 10:13:59,005 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-10-08 10:13:59,005 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-10-08 10:13:59,005 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
is_namedtuple is deprecated, please use the python checks instead
is_namedtuple is deprecated, please use the python checks instead
Traceback (most recent call last):
  File "test_ddp.py", line 32, in <module>
    main()
  File "test_ddp.py", line 13, in main
    torch.multiprocessing.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node,))
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/workspace/data/medical/MONAI/test_ddp.py", line 27, in main_worker
    out = model(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1015, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 976, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 613, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 608, in _conv_forward
    return F.conv3d(
RuntimeError: FIND was unable to find an engine to execute this computation

Thanks.