iree: Vulkan/CUDA runtime error: failed to wait on timepoint

What happened?

Unet failed on some AMD rdna2/rdna3 and nvidia A100 devices, with error:

<vm>:0: OK; failed to wait on timepoint; 
[ 0] bytecode module.forward:131934 [
    <eval_with_key>.9:5042:14,
    <eval_with_key>.9:5039:15,
    <eval_with_key>.9:5038:15,

However, all the dumped dispatches ran without any problem.

Version information

Download the latest Unet model
Compile command for Vulkan iree-compile --iree-input-type=none --iree-hal-target-backends=vulkan --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna2-7900-linux --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_vulkan_torch.mlir -o unet.vmfb
Benchmark command for Vulkan iree-benchmark-module --module=unet.vmfb --function=forward --device=vulkan --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0

Additional context

Also failed on A100, commands for cuda path are as following: iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda --iree-llvmcpu-target-cpu-features=host --iree-hal-cuda-disable-loop-nounroll-wa --iree-hal-cuda-llvm-target-arch=sm_80 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))' unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda/unet_1_64_512_512_fp16_stable-diffusion-2-1-base_cuda_torch.mlir -o unet.vmfb

iree-benchmark-module --module=unet.vmfb --device=cuda --function=forward --input=1x4x64x64xf16 --input=1xf16 --input=2x64x1024xf16 --input=f32=1.0

About this issue

Original URL
State: closed
Created a year ago
Comments: 22 (12 by maintainers)

Commits related to this issue

Refreshing local stack state after VM import calls. Prior to the work adding wait frames it was not possible for the native imports to grow the VM stack. The bytecode dispatch loop was exploiting this... — committed to iree-org/iree by benvanik a year ago
Refreshing local stack state after VM import calls. (#12809) Prior to the work adding wait frames it was not possible for the native imports to grow the VM stack. The bytecode dispatch loop was expl... — committed to iree-org/iree by benvanik a year ago
Refreshing local stack state after VM import calls. (#12809) Prior to the work adding wait frames it was not possible for the native imports to grow the VM stack. The bytecode dispatch loop was expl... — committed to iree-org/iree by benvanik a year ago
Refreshing local stack state after VM import calls. (#12809) Prior to the work adding wait frames it was not possible for the native imports to grow the VM stack. The bytecode dispatch loop was expl... — committed to NatashaKnk/iree by benvanik a year ago

Most upvoted comments

@dan-garvey can you please take over this to verify if this fixes the problem after Ben’s fix ? Lei seemed to notice NaNs after this fix but lets verify on ToM IREE

powderluv on Mar 29, 2023

With the latest python package (iree-compiler 20230328.472), I’m now getting this new error on Unet.

ValueError: Error invoking function: main_checkout/runtime/src/iree/vm/ref.h:181: INVALID_ARGUMENT; ref is null; while invoking native function hal.buffer.assert; while calling import; 
[ 1]   native hal.buffer.assert:0 -
[ 0] bytecode module.forward:8994 [unknown]
Aborted (core dumped)

Not sure if it’s related to the old error, but showing as different message.

yzhang93 on Mar 28, 2023