iree: INTERNAL; CUDA driver error 'CUDA_ERROR_ILLEGAL_ADDRESS' (700): an illegal memory access was encountered;
What happened?
Hi, I’m getting a runtime error while running a model via IREE. The error is as follows:
RuntimeError: Error invoking function: SHARK-Runtime/runtime/src/iree/hal/drivers/cuda/stream_command_buffer.c:517: INTERNAL; CUDA driver error 'CUDA_ERROR_ILLEGAL_ADDRESS' (700): an illegal memory access was encountered; cuLaunchKernel; while invoking native function hal.device.queue.execute; while calling import;
[ 1] native hal.device.queue.execute:0 -
[ 0] bytecode module@0:63472 -
Killed
After throwing the above error, it would continue to use memory and will go on to consume entire memory till the point it gets killed. In my case, it used the 84 GB RAM + 220 GB Swap completely before getting killed.
Steps to reproduce your issue
To reproduce the issue, download the compiled vmfb from: https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/iree_cuda_illegal_address.vmfb
Then run the following command:
iree-run-module --device=cuda --module=iree_cuda_illegal_address.vmfb --function=forward --input=1x100xi64 --input=1x100xi64
To generate a vmfb locally, download the linalg ir from: https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/iree_cuda_illegal_address_linalg.mlir
For compilation, run the following command:
./build/tools/iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvm-embedded-linker-path=iree/build/compiler/bindings/python/iree/compiler/_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs iree_cuda_illegal_address_linalg.mlir -o iree_cuda_illegal_address.vmfb
What component(s) does this issue relate to?
MLIR, Runtime
Version information
Commit Hash: cc43680728795cb7aa87999ffecca8ec10bed682
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (11 by maintainers)
Thanks @benvanik for the help! I think this needs to be fixed in Torch-MLIR.
I measured the memory usage using the
htopcommand on Linux. If you just run the narrowed vmfb on your system then within a few minutes it will get killed because of the memory getting exhausted.Yeah, you can reproduce the issue locally, if you just compile and run the dispatch IR (https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/module_forward_dispatch_4_cuda_nvptx_fb.mlir) or download and run the pre-compiled vmfb for the specified dispatch (https://storage.googleapis.com/shark_tank/vivek/iree_cuda_illegal_address/dispatch_4.vmfb).
Just FYI, this runtime error is there for the CPU backend as well. On the CPU backend, it gives a segfault because of the illegal memory address access and stops running just after that. It doesn’t exhaust the memory on the CPU backend.