vision: ops.deform_conv2d causes CUDA illegal memory access

🐛 Bug

I try to test the speed of deformable conv2d. But always encountered memory error.

To Reproduce

$ ipython
Python 3.8.5 (default, Jul 27 2020, 08:42:51) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch
   ...: import torchvision as tv
   ...: weight = torch.randn(9,9,3,3).cuda()
   ...: weight.requires_grad = True
   ...: img = torch.randn(8, 9, 1000, 110).cuda()
   ...: def test():
   ...:     offset = torch.randn(8,18,1000,110).cuda()
   ...:     out = tv.ops.deform_conv2d(img, offset, weight, padding=1)
   ...:     out.mean().backward()
   ...: 

In [2]: import os

In [3]: os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

In [4]: timeit test()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-a1086d7a4706> in <module>
----> 1 get_ipython().run_line_magic('timeit', 'test()')

/usr/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
   2324                 kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2325             with self.builtin_trap:
-> 2326                 result = fn(*args, **kwargs)
   2327             return result
   2328 

<decorator-gen-60> in timeit(self, line, cell, local_ns)

/usr/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/usr/lib/python3.8/site-packages/IPython/core/magics/execution.py in timeit(self, line, cell, local_ns)
   1171                     break
   1172 
-> 1173         all_runs = timer.repeat(repeat, number)
   1174         best = min(all_runs) / number
   1175         worst = max(all_runs) / number

/usr/lib/python3.8/timeit.py in repeat(self, repeat, number)
    203         r = []
    204         for i in range(repeat):
--> 205             t = self.timeit(number)
    206             r.append(t)
    207         return r

/usr/lib/python3.8/site-packages/IPython/core/magics/execution.py in timeit(self, number)
    167         gc.disable()
    168         try:
--> 169             timing = self.inner(it, self.timer)
    170         finally:
    171             if gcold:

<magic-timeit> in inner(_it, _timer)

<ipython-input-1-a97200bb984a> in test()
      5 img = torch.randn(8, 9, 1000, 110).cuda()
      6 def test():
----> 7     offset = torch.randn(8,18,1000,110).cuda()
      8     out = tv.ops.deform_conv2d(img, offset, weight, padding=1)
      9     out.mean().backward()

RuntimeError: CUDA error: an illegal memory access was encountered

In [5]:

Environment

PyTorch version: 1.6.0 Is debug build: False CUDA used to build PyTorch: 11.0

OS: Arch Linux (x86_64) GCC version: (GCC) 10.1.0 Clang version: 10.0.1 CMake version: version 3.18.1

Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: 11.0.2 GPU models and configuration: GPU 0: GeForce GTX 1050 Ti Nvidia driver version: 450.57 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.0.2 /usr/lib/libcudnn_adv_infer.so.8.0.2 /usr/lib/libcudnn_adv_train.so.8.0.2 /usr/lib/libcudnn_cnn_infer.so.8.0.2 /usr/lib/libcudnn_cnn_train.so.8.0.2 /usr/lib/libcudnn_ops_infer.so.8.0.2 /usr/lib/libcudnn_ops_train.so.8.0.2

Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] torch==1.6.0 [pip3] torch-cluster==1.4.5 [pip3] torch-geometric==1.3.2 [pip3] torch-scatter==1.4.0 [pip3] torch-sparse==0.4.3 [pip3] torchvision==0.7.0a0 [conda] Could not collect

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 40 (6 by maintainers)

Commits related to this issue

[WIP] Fixes #2598 - Adjusted num_kernels and batch_size according to kMaxGridNum * CUDA_NUM_THREADS - Tests to add — committed to Quansight/vision by vfdev-5 4 years ago
Fixes crash in deformable convolutions (2598) (#2604) * [WIP] Fixes #2598 - Adjusted num_kernels and batch_size according to kMaxGridNum * CUDA_NUM_THREADS - Tests to add * - Redefined kMaxGridN... — committed to pytorch/vision by vfdev-5 4 years ago
Fixes crash in deformable convolutions (2598) (#2604) * [WIP] Fixes #2598 - Adjusted num_kernels and batch_size according to kMaxGridNum * CUDA_NUM_THREADS - Tests to add * - Redefined kMaxGridN... — committed to bryant1410/vision-1 by vfdev-5 4 years ago

Most upvoted comments

The following script produces an error but not the exact error from above (although it appears related)

torchvision_bug.py

from typing import Any

import torch
from torch.nn import Conv2d, LeakyReLU
from torch.nn.modules.utils import _pair
from torchvision.ops import DeformConv2d, deform_conv2d


class ModulatedDeformConvPack(DeformConv2d):
    def __init__(self, *args, offset_channels, deformable_groups: int = 1, post_activation: Any = None, **kwargs):
        super(ModulatedDeformConvPack, self).__init__(*args, **kwargs)

        self.deformable_groups = deformable_groups

        self.conv_offset = Conv2d(
            offset_channels,
            self.deformable_groups * 3 * self.kernel_size[0] * self.kernel_size[1],
            kernel_size=self.kernel_size,
            stride=_pair(self.stride),
            padding=_pair(self.padding),
            dilation=_pair(self.dilation),
            bias=True,
        )

        self.init_offset()

        self.post_activation = post_activation()

    def init_offset(self):
        self.conv_offset.weight.data.zero_()
        self.conv_offset.bias.data.zero_()

    def forward(self, x, m):
        out = self.conv_offset(m)
        o1, o2, mask = torch.chunk(out, 3, dim=1)
        offset = torch.cat((o1, o2), dim=1)
        mask = torch.sigmoid(mask)
        res = deform_conv2d(input=x, offset=offset, weight=self.weight, stride=_pair(self.stride), padding=_pair(self.padding), dilation=_pair(self.dilation), mask=mask)

        if self.post_activation is not None:
            res = self.post_activation(res)

        return res


device = torch.device("cuda")

bs = 23

test_in = torch.randn(bs, 160, 256, 256, device=device)
test_offset = torch.randn(bs, 64, 256, 256, device=device)
dfc = ModulatedDeformConvPack(160, 64, 3, 1, 1, offset_channels=64, deformable_groups=8, post_activation=LeakyReLU).to(device)

print(f"Total memory used before DFC call: {(torch.cuda.max_memory_allocated() / torch.cuda.get_device_properties(device).total_memory) * 100}%")

test_out = dfc(test_in, test_offset)

print(f"Total memory used after DFC call: {(torch.cuda.max_memory_allocated() / torch.cuda.get_device_properties(device).total_memory) * 100}%")

print(test_out.cpu()[0, 0, 0, 0])

To reproduce run:

CUDA_LAUNCH_BLOCKING=1 python torchvision_bug.py

error message this time is:

error in deformable_im2col: an illegal memory access was encountered
Traceback (most recent call last):
  File "torchvision_bug.py", line 56, in <module>
    test_out = dfc(test_in, test_offset)
  File "/home/mehrlich/.cache/pypoetry/virtualenvs/fvr-_yTNPG6U-py3.8/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "torchvision_bug.py", line 38, in forward
    res = deform_conv2d(input=x, offset=offset, weight=self.weight, stride=_pair(self.stride), padding=_pair(self.padding), dilation=_pair(self.dilation), mask=mask)
  File "/home/mehrlich/.cache/pypoetry/virtualenvs/fvr-_yTNPG6U-py3.8/lib/python3.8/site-packages/torchvision/ops/deform_conv.py", line 89, in deform_conv2d
    return torch.ops.torchvision.deform_conv2d(
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Note the variable bs. On my test machine I can reproduce at bs=23 or higher, i.e., bs=22 works fine, bs=23 reproduces the error.

I am using a Tesla V100 for testing, this may be related to how much memory the card has, which is why I’m printing the max allocated memory as reported by pytorch. Note that in the working case (22) it’s only ~38% on my card so there should be plenty of headroom, although I dont know for sure how accurate that is.

pytorch version: 1.9.0 torchvision version: 0.10.0 cuda version (as reported by nvidia-smi let me know if thats not right): 11.2

Queuecumber on Aug 11, 2021

Why is kMaxGridDim set to 65K? Maximum gridDim.x is 2**31, it can also be queried from device properties rather than hardcoded.

int grid_x = std::min<int>(
                at::cuda::getCurrentDeviceProperties()->maxGridSize[0],
                cuda::ATenCeilDiv(safe_downcast<int, int64_t>(outputWidth), block_y*BLOCK_STRIDE));

ngimel on Aug 21, 2020