iree: [Regression] `CUDA_ERROR_INVALID_VALUE` when benchmarking BertLargeTFBatch1024
What happened?
Twice-daily benchmarking workflow failed last night when running TF BertLarge Batch 1024:
work/runtime/src/iree/hal/drivers/cuda/stream_command_buffer.c:516: INTERNAL; CUDA driver error 'CUDA_ERROR_INVALID_VALUE' (1): invalid argument; cuLaunchKernel; while invoking native function hal.device.queue.execute; while calling import;
[ 1] native hal.device.queue.execute:0 -
[ 0] bytecode module.forward:44188 [
/work/build-e2e-test-artifacts/e2e_test_artifacts/iree_BertLargeTFBatch1024_094e620948dc85b2b15454ae696216a4e0a1128d4d3cfb2980688dc9e54e0dfe.mlir:4326:13
Github workflow run: https://github.com/openxla/iree/actions/runs/4752495526/jobs/8443436700
Steps to reproduce your issue
On a machine setup to benchmark on A100:
- Download compiled module:
gsutil cp gs://iree-github-actions-postsubmit-artifacts/4752495526/1/e2e-test-artifacts/iree_BertLargeTFBatch1024_module_88f03be31249d371d00c63d44d44732ad4c348bf2354bd2b3b042465d86f7183/module.vmfb /tmp
- Run benchmark:
iree-benchmark-module --module=/tmp/module.vmfb --function=forward --input=1024x384xi32=0 --input=1024x384xi32=0 --device_allocator=caching --device=cuda://0 --benchmark_repetitions=10
What component(s) does this issue relate to?
Runtime
Version information
Commit a806149e35b04a251de8187ad3e8c11af70480f4
Additional context
This is blocking our ability to benchmark and track performance progress.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 54 (50 by maintainers)
Commits related to this issue
- Disable TF Bert-Large due to https://github.com/openxla/iree/issues/13211 — committed to mariecwhite/iree by mariecwhite a year ago
- Disable TF Bert-Large due to #13211 (#13212) — committed to iree-org/iree by mariecwhite a year ago
- Disable TF Bert-Large due to #13211 (#13212) — committed to iree-org/iree by mariecwhite a year ago
- Revert "Disable TF Bert-Large due to #13211 (#13212)" This reverts commit 47da9cf8584dffe31a6834f12b62bfff347bdef3. — committed to ThomasRaoux/iree by ThomasRaoux a year ago
- Disable TF Bert-Large due to #13211 (#13212) — committed to NatashaKnk/iree by mariecwhite a year ago
Thanks Okwan for the triage!
Assigning to @KoolJBlack to confirm if #13294 addresses this issue.
@okkwon could you prefetch #13308 and check if it helps…