iree: INTERNAL; CUDA driver error 'CUDA_ERROR_ILLEGAL_ADDRESS' (700): an illegal memory access was encountered;

What happened?

Hi, I’m getting a runtime error while running a model via IREE. The error is as follows:

RuntimeError: Error invoking function: SHARK-Runtime/runtime/src/iree/hal/drivers/cuda/stream_command_buffer.c:517: INTERNAL; CUDA driver error 'CUDA_ERROR_ILLEGAL_ADDRESS' (700): an illegal memory access was encountered; cuLaunchKernel; while invoking native function hal.device.queue.execute; while calling import; 
[ 1]   native hal.device.queue.execute:0 -
[ 0] bytecode module@0:63472 -
Killed

After throwing the above error, it would continue to use memory and will go on to consume entire memory till the point it gets killed. In my case, it used the 84 GB RAM + 220 GB Swap completely before getting killed.

Steps to reproduce your issue

To reproduce the issue, download the compiled vmfb from: https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/iree_cuda_illegal_address.vmfb

Then run the following command:

iree-run-module --device=cuda --module=iree_cuda_illegal_address.vmfb --function=forward --input=1x100xi64 --input=1x100xi64

To generate a vmfb locally, download the linalg ir from: https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/iree_cuda_illegal_address_linalg.mlir

For compilation, run the following command:

./build/tools/iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvm-embedded-linker-path=iree/build/compiler/bindings/python/iree/compiler/_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs iree_cuda_illegal_address_linalg.mlir -o iree_cuda_illegal_address.vmfb

What component(s) does this issue relate to?

MLIR, Runtime

Version information

Commit Hash: cc43680728795cb7aa87999ffecca8ec10bed682

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (11 by maintainers)

Most upvoted comments

Hah! Yeah, was just about to post the same thing! So I guess we need to figure out if this is something torch.aten.index.Tensor expects to work (negative indices in the list) and something that needs to be lowered differently from that into linalg (tensor.dim and affine map or something) or whether the initial lowering was supposed to do that.

https://pytorch.org/cppdocs/notes/tensor_indexing.html doesn’t mention negatives but looks like they added some support at some point: pytorch/pytorch#229 this may come from a https://pytorch.org/docs/stable/generated/torch.flip.html#torch.flip

(I’m not familiar with torch, but maybe someone else here knows - @rsuderman are you familiar with it?)

Thanks @benvanik for the help! I think this needs to be fixed in Torch-MLIR.

That dispatch should just need 1211521024 * 2 bytes to complete, so I’m not sure how it could consume memory forever. How are you measuring that?

I measured the memory usage using the htop command on Linux. If you just run the narrowed vmfb on your system then within a few minutes it will get killed because of the memory getting exhausted.

Does it reproduce if you compile and run the benchmark MLIR you posted?

Yeah, you can reproduce the issue locally, if you just compile and run the dispatch IR (https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/module_forward_dispatch_4_cuda_nvptx_fb.mlir) or download and run the pre-compiled vmfb for the specified dispatch (https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/dispatch_4.vmfb).

The dispatch itself is interesting because it’s doing a gather and maybe the indexing math is bad, and if we can isolate it then we could try running it on other targets (CPU with ASAN, etc) to see if it’s a more general correctness issue or something CUDA-specific.

Just FYI, this runtime error is there for the CPU backend as well. On the CPU backend, it gives a segfault because of the illegal memory address access and stops running just after that. It doesn’t exhaust the memory on the CPU backend.