iree: [Regression] `CUDA_ERROR_INVALID_VALUE` when benchmarking BertLargeTFBatch1024

What happened?

Twice-daily benchmarking workflow failed last night when running TF BertLarge Batch 1024:

work/runtime/src/iree/hal/drivers/cuda/stream_command_buffer.c:516: INTERNAL; CUDA driver error 'CUDA_ERROR_INVALID_VALUE' (1): invalid argument; cuLaunchKernel; while invoking native function hal.device.queue.execute; while calling import; 
[ 1]   native hal.device.queue.execute:0 -
[ 0] bytecode module.forward:44188 [
    /work/build-e2e-test-artifacts/e2e_test_artifacts/iree_BertLargeTFBatch1024_094e620948dc85b2b15454ae696216a4e0a1128d4d3cfb2980688dc9e54e0dfe.mlir:4326:13

Github workflow run: https://github.com/openxla/iree/actions/runs/4752495526/jobs/8443436700

Steps to reproduce your issue

On a machine setup to benchmark on A100:

  1. Download compiled module:
gsutil cp gs://iree-github-actions-postsubmit-artifacts/4752495526/1/e2e-test-artifacts/iree_BertLargeTFBatch1024_module_88f03be31249d371d00c63d44d44732ad4c348bf2354bd2b3b042465d86f7183/module.vmfb /tmp
  1. Run benchmark:
iree-benchmark-module --module=/tmp/module.vmfb --function=forward --input=1024x384xi32=0 --input=1024x384xi32=0 --device_allocator=caching --device=cuda://0 --benchmark_repetitions=10

What component(s) does this issue relate to?

Runtime

Version information

Commit a806149e35b04a251de8187ad3e8c11af70480f4

Additional context

This is blocking our ability to benchmark and track performance progress.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 54 (50 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks Okwan for the triage!

Assigning to @KoolJBlack to confirm if #13294 addresses this issue.

@okkwon could you prefetch #13308 and check if it helps…