iree: Dynamic stack allocations surviving in inner loops to runtime, causing 💀

While trying to fix our stack size in #11867 I found that posenet and efficientnet in our integrations tests were failing with anything but 512KB of stack. That’s a lot of stack and surprising to me as the model itself barely has any tensors that are that size. Looking through the IR produced it looks like instead of the stack allocation getting hoisted it ends up on the innermost loop. Because LLVM’s alloca by default has function lifetime this means each loop iteration we’re allocating more stack space and whether it blows the stack is dependent on the stack size and the shapes used by the model.

// -----// IR Dump After EmptyTensorToAllocTensor (empty-tensor-to-alloc-tensor) //----- //
module {
  func.func @main_dispatch_18_matmul_2304x144x24() {
    ...
    %6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_y]
    %7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_y]
    %8 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_id_x]
    %9 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_count_x]
    scf.for %arg0 = %6 to %c2304 step %7 {
      %10 = flow.dispatch.tensor.load %0, offsets = [%arg0, 0], sizes = [256, 24], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2304x24xi8>> -> tensor<256x24xi8>
      scf.for %arg1 = %8 to %c144 step %9 {
        %11 = flow.dispatch.tensor.load %5, offsets = [%arg0, %arg1], sizes = [256, 72], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<2304x144xi8>> -> tensor<256x72xi8>
        %12 = flow.dispatch.tensor.load %2, offsets = [%arg1], sizes = [72], strides = [1] : !flow.dispatch.tensor<readonly:tensor<144xi32>> -> tensor<72xi32>
        %13 = flow.dispatch.tensor.load %1, offsets = [0, %arg1], sizes = [24, 72], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<24x144xi8>> -> tensor<24x72xi8>
        %14 = flow.dispatch.tensor.load %3, offsets = [%arg1], sizes = [72], strides = [1] : !flow.dispatch.tensor<readonly:tensor<144xi32>> -> tensor<72xi32>
        %15 = flow.dispatch.tensor.load %4, offsets = [%arg1], sizes = [72], strides = [1] : !flow.dispatch.tensor<readonly:tensor<144xi32>> -> tensor<72xi32>
        %extracted_slice = tensor.extract_slice %cst[%arg1] [72] [1] : tensor<144xi8> to tensor<72xi8>
        %16 = scf.for %arg2 = %c0 to %c256 step %c8 iter_args(%arg3 = %11) -> (tensor<256x72xi8>) {
          %extracted_slice_0 = tensor.extract_slice %10[%arg2, 0] [8, 24] [1, 1] : tensor<256x24xi8> to tensor<8x24xi8>
          %17 = scf.for %arg4 = %c0 to %c72 step %c32 iter_args(%arg5 = %arg3) -> (tensor<256x72xi8>) {
            %18 = affine.min affine_map<(d0) -> (-d0 + 72, 32)>(%arg4)
            %extracted_slice_1 = tensor.extract_slice %12[%arg4] [%18] [1] : tensor<72xi32> to tensor<?xi32>
            %extracted_slice_2 = tensor.extract_slice %13[0, %arg4] [24, %18] [1, 1] : tensor<24x72xi8> to tensor<24x?xi8>
            %19 = bufferization.alloc_tensor(%18) : tensor<8x?xi32>
            %20 = linalg.fill ins(%c0_i32 : i32) outs(%19 : tensor<8x?xi32>) -> tensor<8x?xi32>
// -----// IR Dump After IREEComprehensiveBufferize (iree-codegen-iree-comprehensive-bufferize) //----- //
module {
  func.func @main_dispatch_18_matmul_2304x144x24() {
    ...
    %13 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_y]
    %14 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_y]
    %15 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_id_x]
    %16 = affine.apply affine_map<()[s0] -> (s0 * 72)>()[%workgroup_count_x]
    scf.for %arg0 = %13 to %c2304 step %14 {
      %subview = memref.subview %1[%arg0, 0] [256, 24] [1, 1] : memref<2304x24xi8, #hal.descriptor_type<storage_buffer>> to memref<256x24xi8, strided<[24, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      scf.for %arg1 = %15 to %c144 step %16 {
        %subview_0 = memref.subview %11[%arg0, %arg1] [256, 72] [1, 1] : memref<2304x144xi8, #hal.descriptor_type<storage_buffer>> to memref<256x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        %subview_1 = memref.subview %5[%arg1] [72] [1] : memref<144xi32, #hal.descriptor_type<storage_buffer>> to memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        %subview_2 = memref.subview %3[0, %arg1] [24, 72] [1, 1] : memref<24x144xi8, #hal.descriptor_type<storage_buffer>> to memref<24x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        %subview_3 = memref.subview %7[%arg1] [72] [1] : memref<144xi32, #hal.descriptor_type<storage_buffer>> to memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        %subview_4 = memref.subview %9[%arg1] [72] [1] : memref<144xi32, #hal.descriptor_type<storage_buffer>> to memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        %subview_5 = memref.subview %0[%arg1] [72] [1] : memref<144xi8> to memref<72xi8, strided<[1], offset: ?>>
        %17 = scf.for %arg2 = %c0 to %c256 step %c8 iter_args(%arg3 = %subview_0) -> (memref<256x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) {
          %subview_7 = memref.subview %subview[%arg2, 0] [8, 24] [1, 1] : memref<256x24xi8, strided<[24, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<8x24xi8, strided<[24, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
          %18 = scf.for %arg4 = %c0 to %c72 step %c32 iter_args(%arg5 = %arg3) -> (memref<256x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) {
            %19 = affine.min affine_map<(d0) -> (-d0 + 72, 32)>(%arg4)
            %subview_8 = memref.subview %subview_1[%arg4] [%19] [1] : memref<72xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<?xi32, strided<[1], offset: ?>, #hal.descriptor_type<storage_buffer>>
            %subview_9 = memref.subview %subview_2[0, %arg4] [24, %19] [1, 1] : memref<24x72xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>> to memref<24x?xi8, strided<[144, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
            %alloca = memref.alloca(%19) {alignment = 64 : i64} : memref<8x?xi32>

```mlir
        llvm.func @main_dispatch_18_matmul_2304x144x24(%arg0: !llvm.ptr<struct<"iree_hal_executable_environment_v0_t", (ptr<i32>, ptr<func<i32 (ptr<func<i32 (ptr<i8>, ptr<i8>, ptr<i8>)>>, ptr<i8>, ptr<i8>, ptr<i8>)>>, ptr<ptr<func<i32 (ptr<i8>, ptr<i8>, ptr<i8>)>>>, ptr<ptr<i8>>, struct<"iree_hal_processor_v0_t", (array<8 x i64>)>)>> {llvm.align = 16 : i64, llvm.noalias}, %arg1: !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (i32, i32, i16, i16, i32, i32, i16, i8, i8, ptr<i32>, ptr<ptr<i8>>, ptr<i64>)>> {llvm.align = 16 : i64, llvm.noalias}, %arg2: !llvm.ptr<struct<"iree_hal_executable_workgroup_state_v0_t", (i32, i32, i16, i16, i32, ptr<ptr<i8>>, i32)>> {llvm.align = 16 : i64, llvm.noalias}) -> i32 {
          ...
          %75 = llvm.mul %74, %5  : i64
          %76 = llvm.mul %74, %4  : i64
          %77 = llvm.mul %72, %14  : i64
          %78 = llvm.add %76, %77  : i64
          llvm.br ^bb1(%15 : i64)
        ^bb1(%79: i64):  // 2 preds: ^bb0, ^bb25
          %80 = llvm.icmp "slt" %79, %13 : i64
          llvm.cond_br %80, ^bb2, ^bb26
        ^bb2:  // pred: ^bb1
          %81 = llvm.mul %79, %10  : i64
          %82 = llvm.add %75, %81  : i64
          llvm.br ^bb3(%15 : i64)
        ^bb3(%83: i64):  // 2 preds: ^bb2, ^bb24
          %84 = llvm.icmp "slt" %83, %14 : i64
          llvm.cond_br %84, ^bb4, ^bb25
        ^bb4:  // pred: ^bb3
          %85 = llvm.mul %83, %3  : i64
          %86 = llvm.add %85, %14  : i64
          %87 = llvm.icmp "slt" %86, %12 : i64
          %88 = llvm.select %87, %86, %12 : i1, i64
          %89 = llvm.add %77, %83  : i64
          %90 = llvm.mul %88, %11  : i64
          %91 = llvm.mlir.null : !llvm.ptr<i32>
          %92 = llvm.getelementptr %91[%90] : (!llvm.ptr<i32>, i64) -> !llvm.ptr<i32>
          %93 = llvm.ptrtoint %92 : !llvm.ptr<i32> to i64
          %94 = llvm.alloca %93 x i32 {alignment = 64 : i64} : (i64) -> !llvm.ptr<i32>

The two tflite models I noticed exhibiting this: https://storage.googleapis.com/iree-model-artifacts/tflite-integration-tests/posenet_i8.tflite https://storage.googleapis.com/iree-model-artifacts/efficientnet_lite0_int8_2.tflite

I passed those through pre-release iree-import-tflite and then iree-compile with https://reviews.llvm.org/D141981 applied (as otherwise they can’t compile at head),

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (14 by maintainers)

Commits related to this issue

Most upvoted comments

Adding @hanhanW as well. It seems like the matmul pad + hoist is not kicking in for i8 types. This is resulting the alloc not being static.

I am planning to strengthen the compile time check to only allow buffers of static shape in the entry block. Thats the only way to ensure static bounded allocations.

I dont think we need padding/peeling. I think this used to work before the masking changes… I think the vector size chose is 8x32 for the i and j loops. Issue is that 32 doesnt divide 144. WIthout masking/peeling/padding/etc. we would need to ensure that the tile size divides problem size. That seems to have changed (I believe unintentionally) when the masking change landed (let me know if that is wrong). So I would try to get back to the original state (I think the tile size for vectorization was 8x16 which divides the problem size). With that the problem should go away.

@dcaballe what you are suggesting is probably true long term, but this is a regression for now till the complete solution can take its place.