iree: Vulkan/CUDA runtime error: failed to wait on timepoint

What happened?

Unet failed on some AMD rdna2/rdna3 and nvidia A100 devices, with error:

<vm>:0: OK; failed to wait on timepoint; 
[ 0] bytecode module.forward:131934 [
    <eval_with_key>.9:5042:14,
    <eval_with_key>.9:5039:15,
    <eval_with_key>.9:5038:15,

However, all the dumped dispatches ran without any problem.

Version information

  1. Download the latest Unet model
  2. Compile command for Vulkan iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o unet.vmfb
  3. Benchmark command for Vulkan iree-benchmark-module --module=unet.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0

Additional context

Also failed on A100, commands for cuda path are as following: iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvmcpu-target-cpu-features=host --iree-hal-cuda-disable-loop-nounroll-wa --iree-hal-cuda-llvm-target-arch=sm_80 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda_torch.mlir -o unet.vmfb

iree-benchmark-module --module=unet.vmfb --device=cuda --function=forward --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (12 by maintainers)

Commits related to this issue

Most upvoted comments

@dan-garvey can you please take over this to verify if this fixes the problem after Ben’s fix ? Lei seemed to notice NaNs after this fix but lets verify on ToM IREE

With the latest python package (iree-compiler 20230328.472), I’m now getting this new error on Unet.

ValueError: Error invoking function: main_checkout/runtime/src/iree/vm/ref.h:181: INVALID_ARGUMENT; ref is null; while invoking native function hal.buffer.assert; while calling import; 
[ 1]   native hal.buffer.assert:0 -
[ 0] bytecode module.forward:8994 [unknown]
Aborted (core dumped)

Not sure if it’s related to the old error, but showing as different message.