apex: RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101)

File "../ptx/fit_extension.py", line 386, in _train_epoch scaled_loss.backward() File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "../../apex/apex/amp/handle.py", line 125, in scale_loss optimizer._post_amp_backward(loss_scaler) File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights models_are_masters=False) File "../../apex/apex/amp/scaler.py", line 113, in unscale 1./scale) File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__ *args) RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)

I use single card to run the amp, it produced the above error. However I use more than one cards to train, it doesn’t produce ant error.

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 32 (5 by maintainers)

Most upvoted comments

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

Yep, same problem.

device = torch.device(‘cuda:0’) works OK

device = torch.device(‘cuda:1’) fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device(‘cuda:1’))

I’m guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it’s likely not my code.

I also encoutered this error. I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1'). When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.

At the scaler.py, there is one line code self._overflow_buf = torch.cuda.IntTensor([0]), which initialize the variable on the default cuda device, if the model is on another device, then we will encounter the error “CUDA error: an illegal memory access was encountered”

I haven’t used Apex/AMP before, so maybe there is some user error here. That said, I also seems to get an error when using a device other than the default device. The code at the end gives me:

RuntimeError: CUDA error: an illegal memory access was encountered

for opt_levels O1 and O2. In particular, I do not seem to get an error for opt_level O3.

Version information:

  • Apex commit: 8be5b6bedead620db636516d064db39f82052e01(latest commit when I installed it)
  • torch.version.git_version = '20607a99a31ec5405ca6aa92bc7e7bf768b7bc43' (just installed latest stable using official instructions this morning)
  • Nvidia driver: 430.14
  • Running this in docker container based on: nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 (e25e57dde9ade23a377536df339be4d8410a7a7bcddb1e96b0e2db63ac088ed4)
import torch
import torchvision

from apex import amp

device = "cuda:1"
wantIllegalAccessException = True

if __name__ == '__main__':
  if not wantIllegalAccessException:
    torch.cuda.set_device(device)

  model = torchvision.models.resnet34().to(device)
  optimizer = torch.optim.Adam(model.parameters(), 1e-3)
  criterion = torch.nn.CrossEntropyLoss().to(device)

  model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

  input = torch.randn(2, 3, 224, 224, device=device)
  target = torch.randint(0, 999, [input.shape[0]], device=device)

  output = model(input)
  loss = criterion(output, target)

  optimizer.zero_grad()
  with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
  optimizer.step()

@ll0iecas Sorry I am in no way an expert on this and I encountered this error not in this particular package. FYI my problem was because of too large batch size.

@ReactiveCJ is probably right about the source of the error. However, in general, when using multiple GPUs or manually trying to use a GPU other than the default, it’s definitely best practice to call torch.cuda.set_device before you construct your model or call amp.initialize. Calling .to manually on your model is error-prone and might not catch everything (even if you aren’t using Amp).