iree: [Vulkan] UNet Compilation failure - "error:no GPU subgroup mma compute ops generated"

What happened?

On attempting to compile UNet on vulkan (Ubuntu, NVidia A100 40G GPU) Compilation fails on multiple dispatches with error: no GPU subgroup mma compute ops generated followed by error: failed to run translation of source executable to target executable for backend...

dispatch22.mlir:15:13: error: no GPU subgroup mma compute ops generated
    %26:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>, affine_map<(d0, d1, d2) -> (d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%25, %arg2, %arg3, %arg4 : tensor<2x320x4096xf16>, tensor<320xf16>, tensor<320xf16>, tensor<2x320xf16>) outs(%22, %21 : tensor<2x320x4096xf16>, tensor<2x320x4096xf32>) {
            ^
dispatch22.mlir:2:3: note: called from
  func.func @forward_dispatch_22_generic_2x320x4096x2880_f16(%arg0: tensor<320x2880xf16>, %arg1: tensor<2x2880x4096xf16>, %arg2: tensor<320xf16> , %arg3: tensor<320xf16>, %arg4: tensor<2x320xf16>) -> (tensor<2x320x4096xf16>, tensor<2x320x4096xf32>) {
  ^
dispatch22.mlir:15:13: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixNV], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_NV_cooperative_matrix]>, api=Vulkan, NVIDIA:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 49152, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 64], min_subgroup_size = 32, max_subgroup_size = 32, cooperative_matrix_properties_nv = [#spirv.coop_matrix_props<m_size = 8, n_size = 8, k_size = 32, a_type = i8, b_type = i8, c_type = i32, result_type = i32, scope = <Subgroup>>, #spirv.coop_matrix_props<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, scope = <Subgroup>>, #spirv.coop_matrix_props<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, scope = <Subgroup>>]>>}>
    %26:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>, affine_map<(d0, d1, d2) -> (d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%25, %arg2, %arg3, %arg4 : tensor<2x320x4096xf16>, tensor<320xf16>, tensor<320xf16>, tensor<2x320xf16>) outs(%22, %21 : tensor<2x320x4096xf16>, tensor<2x320x4096xf32>) {
            ^
dispatch22.mlir:2:3: note: called from
  func.func @forward_dispatch_22_generic_2x320x4096x2880_f16(%arg0: tensor<320x2880xf16>, %arg1: tensor<2x2880x4096xf16>, %arg2: tensor<320xf16> , %arg3: tensor<320xf16>, %arg4: tensor<2x320xf16>) -> (tensor<2x320x4096xf16>, tensor<2x320x4096xf32>) {
  ^
dispatch22.mlir:15:13: error: failed to serialize executables
    %26:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1)>, affine_map<(d0, d1, d2) -> (d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%25, %arg2, %arg3, %arg4 : tensor<2x320x4096xf16>, tensor<320xf16>, tensor<320xf16>, tensor<2x320xf16>) outs(%22, %21 : tensor<2x320x4096xf16>, tensor<2x320x4096xf32>) {
            ^
dispatch22.mlir:2:3: note: called from
  func.func @forward_dispatch_22_generic_2x320x4096x2880_f16(%arg0: tensor<320x2880xf16>, %arg1: tensor<2x2880x4096xf16>, %arg2: tensor<320xf16> , %arg3: tensor<320xf16>, %arg4: tensor<2x320xf16>) -> (tensor<2x320x4096xf16>, tensor<2x320x4096xf32>) {
  ^

Steps to reproduce your issue

dispatch22_bug.txt dispatch22_OG.txt

Compile command :

iree-compile dispatch22_bug.txt --iree-input-type=tm_tensor \
--iree-vm-bytecode-module-output-format=flatbuffer-binary \
--iree-hal-target-backends=vulkan \
--mlir-print-debuginfo --mlir-print-op-on-diagnostic=false \
-iree-vulkan-target-triple=ampere-a100-linux \
--iree-llvmcpu-target-cpu-features=host \
--iree-stream-resource-index-bits=64 \
--iree-vm-target-index-bits=64 \
--iree-vm-bytecode-module-strip-source-map=true \
--iree-util-zero-fill-elided-attrs \
-o dispatch22.vmfb \
--iree-preprocessing-pass-pipeline="builtin.module(func.func(iree-flow-detach-elementwise-from-named-ops,iree-flow-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32}))"

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

About this issue

Original URL
State: closed
Created a year ago
Comments: 17 (6 by maintainers)

Commits related to this issue

Cherry-pick llvm/llvm-project@a0119437 Fixes https://github.com/openxla/iree/issues/14527 — committed to antiagainst/iree by antiagainst a year ago
Cherry-pick llvm/llvm-project@a0119437 (#14552) This commits adds support for `arith.extf` in SPIR-V cooperative matrix lowering pipelines. Fixes https://github.com/openxla/iree/issues/14527 — committed to iree-org/iree by antiagainst a year ago
Cherry-pick llvm/llvm-project@a0119437 (#14552) This commits adds support for `arith.extf` in SPIR-V cooperative matrix lowering pipelines. Fixes https://github.com/openxla/iree/issues/14527 — committed to plaidml/iree by antiagainst a year ago

Most upvoted comments

OK, so I’ve checked the commit history for the SHARK-Runtime version given above and noticed it doesn’t actually include the llvm cherry-pick despite the date shown – however, there is still an issue with UNet compilation and this time it doesn’t give any diagnostics.

There was a SHARK-Runtime version (20230802.198) on which I can successfully compile, which has @antiagainst 's llvm cherry-pick as the latest commit, but the compile failure without diagnostics happens in the next release. I am working on triaging this new failure.

monorimet on Aug 3, 2023

It’s failing around winograd transforms for me with the above input and command line options. I assume it needs some pieces from shark side? Can you provide the specific dispatch too?

antiagainst on Aug 3, 2023