iree: Dynamic stack allocations surviving in inner loops to runtime, causing 💀
While trying to fix our stack size in #11867 I found that posenet and efficientnet in our integrations tests were failing with anything but 512KB of stack. That’s a lot of stack and surprising to me as the model itself barely has any tensors that are that size. Looking through the IR produced it looks like instead of the stack allocation getting hoisted it ends up on the innermost loop. Because LLVM’s alloca by default has function lifetime this means each loop iteration we’re allocating more stack space and whether it blows the stack is dependent on the stack size and the shapes used by the model.
// -----// IR Dump After EmptyTensorToAllocTensor (empty-tensor-to-alloc-tensor) //----- //
module {
func.func @main_dispatch_18_matmul_2304x144x24() {
...
%6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_y]
%7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_y]
%8 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_id_x]
%9 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_count_x]
scf.for %arg0 = %6 to %c2304 step %7 {
%10 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0], sizes = [256, 24], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2304x24xi8>> -> tensor<256x24xi8>
scf.for %arg1 = %8 to %c144 step %9 {
%11 = flow.dispatch.tensor.load %5, offsets = [%arg0, %arg1], sizes = [256, 72], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<2304x144xi8>> -> tensor<256x72xi8>
%12 = flow.dispatch.tensor.load %2, offsets = [%arg1], sizes = [72], strides = [1] : !flow.dispatch.tensor<readonly:tensor<144xi32>> -> tensor<72xi32>
%13 = flow.dispatch.tensor.load %1, offsets = [0, %arg1], sizes = [24, 72], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<24x144xi8>> -> tensor<24x72xi8>
%14 = flow.dispatch.tensor.load %3, offsets = [%arg1], sizes = [72], strides = [1] : !flow.dispatch.tensor<readonly:tensor<144xi32>> -> tensor<72xi32>
%15 = flow.dispatch.tensor.load %4, offsets = [%arg1], sizes = [72], strides = [1] : !flow.dispatch.tensor<readonly:tensor<144xi32>> -> tensor<72xi32>
%extracted_slice = tensor.extract_slice %cst[%arg1] [72] [1] : tensor<144xi8> to tensor<72xi8>
%16 = scf.for %arg2 = %c0 to %c256 step %c8 iter_args(%arg3 = %11) -> (tensor<256x72xi8>) {
%extracted_slice_0 = tensor.extract_slice %10[%arg2, 0] [8, 24] [1, 1] : tensor<256x24xi8> to tensor<8x24xi8>
%17 = scf.for %arg4 = %c0 to %c72 step %c32 iter_args(%arg5 = %arg3) -> (tensor<256x72xi8>) {
%18 = affine.min affine_map<(d0) -> (-d0 + 72, 32)>(%arg4)
%extracted_slice_1 = tensor.extract_slice %12[%arg4] [%18] [1] : tensor<72xi32> to tensor<?xi32>
%extracted_slice_2 = tensor.extract_slice %13[0, %arg4] [24, %18] [1, 1] : tensor<24x72xi8> to tensor<24x?xi8>
%19 = bufferization.alloc_tensor(%18) : tensor<8x?xi32>
%20 = linalg.fill ins(%c0_i32 : i32) outs(%19 : tensor<8x?xi32>) -> tensor<8x?xi32>
// -----// IR Dump After IREEComprehensiveBufferize (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
func.func @main_dispatch_18_matmul_2304x144x24() {
...
%13 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_y]
%14 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_y]
%15 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_id_x]
%16 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_count_x]
scf.for %arg0 = %13 to %c2304 step %14 {
%subview = memref.subview %1[%arg0, 0] [256, 24] [1, 1] : memref<2304x24xi8, #hal.descriptor_type<storage_buffer>> to memref<256x24xi8, strided<[24, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
scf.for %arg1 = %15 to %c144 step %16 {
%subview_0 = memref.subview %11[%arg0, %arg1] [256, 72] [1, 1] : memref<2304x144xi8, #hal.descriptor_type<storage_buffer>> to memref<256x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_1 = memref.subview %5[%arg1] [72] [1] : memref<144xi32, #hal.descriptor_type<storage_buffer>> to memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_2 = memref.subview %3[0, %arg1] [24, 72] [1, 1] : memref<24x144xi8, #hal.descriptor_type<storage_buffer>> to memref<24x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_3 = memref.subview %7[%arg1] [72] [1] : memref<144xi32, #hal.descriptor_type<storage_buffer>> to memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_4 = memref.subview %9[%arg1] [72] [1] : memref<144xi32, #hal.descriptor_type<storage_buffer>> to memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_5 = memref.subview %0[%arg1] [72] [1] : memref<144xi8> to memref<72xi8, strided<[1], offset: ?>>
%17 = scf.for %arg2 = %c0 to %c256 step %c8 iter_args(%arg3 = %subview_0) -> (memref<256x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) {
%subview_7 = memref.subview %subview[%arg2, 0] [8, 24] [1, 1] : memref<256x24xi8, strided<[24, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<8x24xi8, strided<[24, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%18 = scf.for %arg4 = %c0 to %c72 step %c32 iter_args(%arg5 = %arg3) -> (memref<256x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) {
%19 = affine.min affine_map<(d0) -> (-d0 + 72, 32)>(%arg4)
%subview_8 = memref.subview %subview_1[%arg4] [%19] [1] : memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<?xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%subview_9 = memref.subview %subview_2[0, %arg4] [24, %19] [1, 1] : memref<24x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<24x?xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
%alloca = memref.alloca(%19) {alignment = 64 : i64} : memref<8x?xi32>
```mlir
llvm.func @main_dispatch_18_matmul_2304x144x24(%arg0: !llvm.ptr<struct<"iree_hal_executable_environment_v0_t", (ptr<i32>, ptr<func<i32 (ptr<func<i32 (ptr<i8>, ptr<i8>, ptr<i8>)>>, ptr<i8>, ptr<i8>, ptr<i8>)>>, ptr<ptr<func<i32 (ptr<i8>, ptr<i8>, ptr<i8>)>>>, ptr<ptr<i8>>, struct<"iree_hal_processor_v0_t", (array<8 x i64>)>)>> {llvm.align = 16 : i64, llvm.noalias}, %arg1: !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr<i32>, ptr<ptr<i8>>, ptr<i64>)>> {llvm.align = 16 : i64, llvm.noalias}, %arg2: !llvm.ptr<struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr<ptr<i8>>, i32)>> {llvm.align = 16 : i64, llvm.noalias}) -> i32 {
...
%75 = llvm.mul %74, %5 : i64
%76 = llvm.mul %74, %4 : i64
%77 = llvm.mul %72, %14 : i64
%78 = llvm.add %76, %77 : i64
llvm.br ^bb1(%15 : i64)
^bb1(%79: i64): // 2 preds: ^bb0, ^bb25
%80 = llvm.icmp "slt" %79, %13 : i64
llvm.cond_br %80, ^bb2, ^bb26
^bb2: // pred: ^bb1
%81 = llvm.mul %79, %10 : i64
%82 = llvm.add %75, %81 : i64
llvm.br ^bb3(%15 : i64)
^bb3(%83: i64): // 2 preds: ^bb2, ^bb24
%84 = llvm.icmp "slt" %83, %14 : i64
llvm.cond_br %84, ^bb4, ^bb25
^bb4: // pred: ^bb3
%85 = llvm.mul %83, %3 : i64
%86 = llvm.add %85, %14 : i64
%87 = llvm.icmp "slt" %86, %12 : i64
%88 = llvm.select %87, %86, %12 : i1, i64
%89 = llvm.add %77, %83 : i64
%90 = llvm.mul %88, %11 : i64
%91 = llvm.mlir.null : !llvm.ptr<i32>
%92 = llvm.getelementptr %91[%90] : (!llvm.ptr<i32>, i64) -> !llvm.ptr<i32>
%93 = llvm.ptrtoint %92 : !llvm.ptr<i32> to i64
%94 = llvm.alloca %93 x i32 {alignment = 64 : i64} : (i64) -> !llvm.ptr<i32>
The two tflite models I noticed exhibiting this: https://storage.googleapis.com/iree-model-artifacts/tflite-integration-tests/posenet_i8.tflite https://storage.googleapis.com/iree-model-artifacts/efficientnet_lite0_int8_2.tflite
I passed those through pre-release iree-import-tflite
and then iree-compile
with https://reviews.llvm.org/D141981 applied (as otherwise they can’t compile at head),
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (14 by maintainers)
Commits related to this issue
- Make sure the selected workgroup size divides the problem size. (#11907) Without either pre-padding or masking not having the tile sizes divide the problem size for all tiling levels will result in u... — committed to iree-org/iree by MaheshRavishankar a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent. Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoisted all t... — committed to MaheshRavishankar/iree by deleted user a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent. Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoisted all t... — committed to MaheshRavishankar/iree by deleted user a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent. Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoisted all t... — committed to MaheshRavishankar/iree by deleted user a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent (#11938) Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoist... — committed to iree-org/iree by MaheshRavishankar a year ago
- [LLVMCPU] Turn x86 quantized matmul back to padding approach. This is fixed by upstream commit llvm-project@061201ec3d6d78ca5d5a583eb9141623ea3f66e7 Fixes https://github.com/iree-org/iree/issues/118... — committed to hanhanW/iree by hanhanW a year ago
- [LLVMCPU] Turn x86 quantized matmul back to padding approach. (#12227) This is fixed by https://github.com/llvm/llvm-project/commit/061201ec3d6d78ca5d5a583eb9141623ea3f66e7 Fixes https://github.co... — committed to iree-org/iree by hanhanW a year ago
- Make sure the selected workgroup size divides the problem size. (#11907) Without either pre-padding or masking not having the tile sizes divide the problem size for all tiling levels will result in u... — committed to qedawkins/iree by MaheshRavishankar a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent (#11938) Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoist... — committed to qedawkins/iree by MaheshRavishankar a year ago
- [LLVMCPU] Turn x86 quantized matmul back to padding approach. (#12227) This is fixed by https://github.com/llvm/llvm-project/commit/061201ec3d6d78ca5d5a583eb9141623ea3f66e7 Fixes https://github.co... — committed to qedawkins/iree by hanhanW a year ago
- Make sure the selected workgroup size divides the problem size. (#11907) Without either pre-padding or masking not having the tile sizes divide the problem size for all tiling levels will result in u... — committed to iree-org/iree by MaheshRavishankar a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent (#11938) Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoist... — committed to iree-org/iree by MaheshRavishankar a year ago
- [LLVMCPU] Turn x86 quantized matmul back to padding approach. (#12227) This is fixed by https://github.com/llvm/llvm-project/commit/061201ec3d6d78ca5d5a583eb9141623ea3f66e7 Fixes https://github.co... — committed to iree-org/iree by hanhanW a year ago
- Make sure the selected workgroup size divides the problem size. (#11907) Without either pre-padding or masking not having the tile sizes divide the problem size for all tiling levels will result in u... — committed to plaidml/iree by MaheshRavishankar a year ago
- Make stack allocations checks in generated code on LLVMCPU more stringent (#11938) Earlier checks were ensuring that the stack allocation sizes are bounded, but without enforcing that these are hoist... — committed to plaidml/iree by MaheshRavishankar a year ago
- [LLVMCPU] Turn x86 quantized matmul back to padding approach. (#12227) This is fixed by https://github.com/llvm/llvm-project/commit/061201ec3d6d78ca5d5a583eb9141623ea3f66e7 Fixes https://github.co... — committed to plaidml/iree by hanhanW a year ago
Adding @hanhanW as well. It seems like the matmul pad + hoist is not kicking in for i8 types. This is resulting the alloc not being static.
I am planning to strengthen the compile time check to only allow buffers of static shape in the entry block. Thats the only way to ensure static bounded allocations.
I dont think we need padding/peeling. I think this used to work before the masking changes… I think the vector size chose is 8x32 for the i and j loops. Issue is that 32 doesnt divide 144. WIthout masking/peeling/padding/etc. we would need to ensure that the tile size divides problem size. That seems to have changed (I believe unintentionally) when the masking change landed (let me know if that is wrong). So I would try to get back to the original state (I think the tile size for vectorization was 8x16 which divides the problem size). With that the problem should go away.
@dcaballe what you are suggesting is probably true long term, but this is a regression for now till the complete solution can take its place.