iree: INTERNAL; CUDA driver error 'CUDA_ERROR_ILLEGAL_ADDRESS' (700): an illegal memory access was encountered;
What happened?
Hi, I’m getting a runtime error while running a model via IREE. The error is as follows:
RuntimeError: Error invoking function: SHARK-Runtime/runtime/src/iree/hal/drivers/cuda/stream_command_buffer.c:517: INTERNAL; CUDA driver error 'CUDA_ERROR_ILLEGAL_ADDRESS' (700): an illegal memory access was encountered; cuLaunchKernel; while invoking native function hal.device.queue.execute; while calling import;
[ 1] native hal.device.queue.execute:0 -
[ 0] bytecode module@0:63472 -
Killed
After throwing the above error, it would continue to use memory and will go on to consume entire memory till the point it gets killed
. In my case, it used the 84 GB RAM + 220 GB Swap completely before getting killed.
Steps to reproduce your issue
To reproduce the issue, download the compiled vmfb from: https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/iree_cuda_illegal_address.vmfb
Then run the following command:
iree-run-module --device=cuda --module=iree_cuda_illegal_address.vmfb --function=forward --input=1x100xi64 --input=1x100xi64
To generate a vmfb locally, download the linalg ir from: https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/iree_cuda_illegal_address_linalg.mlir
For compilation, run the following command:
./build/tools/iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvm-embedded-linker-path=iree/build/compiler/bindings/python/iree/compiler/_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs iree_cuda_illegal_address_linalg.mlir -o iree_cuda_illegal_address.vmfb
What component(s) does this issue relate to?
MLIR, Runtime
Version information
Commit Hash: cc43680728795cb7aa87999ffecca8ec10bed682
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (11 by maintainers)
Thanks @benvanik for the help! I think this needs to be fixed in Torch-MLIR.
I measured the memory usage using the
htop
command on Linux. If you just run the narrowed vmfb on your system then within a few minutes it will get killed because of the memory getting exhausted.Yeah, you can reproduce the issue locally, if you just compile and run the dispatch IR (https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/module_forward_dispatch_4_cuda_nvptx_fb.mlir) or download and run the pre-compiled vmfb for the specified dispatch (https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/dispatch_4.vmfb).
Just FYI, this runtime error is there for the CPU backend as well. On the CPU backend, it gives a segfault because of the illegal memory address access and stops running just after that. It doesn’t exhaust the memory on the CPU backend.