iree: Unknown upper bound allocation due to LinalgStrategyPeelPass

We found that LinalgStrategyPeelPass optimizes away some affine.min used by tensor.empty, which makes a later check not be able to calculate the upper bound of memory allocation.

This is originally found when debugging a failed test iree_tf_tests/math/llvmcpu__dynamic_dim_softmax.run with a WIP change (#10770).

Here is a crafted example to reproduce the issue:

func.func @peel_dyn(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %dim0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
  %dim1 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
  // The alloc_buf is uninitialized to avoid being optimized.
  %alloc_buf = tensor.empty(%dim0) : tensor<?xf32>
  %out = tensor.empty(%dim0, %dim1) : tensor<?x?xf32>
  %res = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]
  } ins(%arg0, %arg1, %alloc_buf : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%out : tensor<?x?xf32>) {
  ^bb0(%a: f32, %b: f32, %c: f32, %z: f32):
    %w = arith.maxf %a, %b : f32
    %x = arith.minf %w, %c : f32
    linalg.yield %x : f32
  } -> tensor<?x?xf32>
  return %res : tensor<?x?xf32>
}

By compiling the example with iree-compile --iree-hal-target-backends=llvm-cpu, the compiler throws an error:

error: 'memref.alloca' op expected no stack allocations without upper bound shapes

What Happened

Before LinalgStrategyPeelPass, the example is tiled into:

// -----// IR Dump Before LinalgStrategyPeelPass (iree-linalg-strategy-peel-pass) //----- //
func.func @peel_dyn_dispatch_0_generic_DxD() {
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.constant.load[0] : i32
  %1 = hal.interface.constant.load[1] : i32
  %2 = hal.interface.constant.load[2] : i32
  %3 = hal.interface.constant.load[3] : i32
  %4 = arith.index_castui %0 : i32 to index
  %5 = arith.index_castui %1 : i32 to index
  %6 = arith.index_castui %2 : i32 to index
  %7 = arith.index_castui %3 : i32 to index
  %8 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7}
  %9 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5}
  %10 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %11 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_y]
  %12 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_y]
  %13 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
  %14 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
  scf.for %arg0 = %11 to %6 step %12 {
    %15 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg0)[%6]
    scf.for %arg1 = %13 to %7 step %14 {
      %16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg1)[%7]
      %17 = flow.dispatch.tensor.load %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
      %18 = flow.dispatch.tensor.load %8, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
      %19 = flow.dispatch.tensor.load %9, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5} -> tensor<?x?xf32>
      %dim = tensor.dim %18, %c0 : tensor<?x?xf32>
      %dim_0 = tensor.dim %18, %c1 : tensor<?x?xf32>
      %20 = scf.for %arg2 = %c0 to %dim step %c4 iter_args(%arg3 = %17) -> (tensor<?x?xf32>) {
        %21 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg2)[%dim]
        %22 = tensor.empty(%21) : tensor<?xf32>
        %23 = scf.for %arg4 = %c0 to %dim_0 step %c4 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
          %24 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg4)[%dim_0]
          %extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %25 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %22 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs =  {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
          ^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
            %26 = arith.maxf %in, %in_3 : f32
            %27 = arith.minf %26, %in_4 : f32
            linalg.yield %27 : f32
          } -> tensor<?x?xf32>
          %inserted_slice = tensor.insert_slice %25 into %arg5[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
          scf.yield %inserted_slice : tensor<?x?xf32>
        }
        scf.yield %23 : tensor<?x?xf32>
      }
      flow.dispatch.tensor.store %20, %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
    }
  }
  return
}

After LinalgStrategyPeelPass, the output is:

// -----// IR Dump After LinalgStrategyPeelPass (iree-linalg-strategy-peel-pass) //----- //
func.func @peel_dyn_dispatch_0_generic_DxD() {
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.constant.load[0] : i32
  %1 = hal.interface.constant.load[1] : i32
  %2 = hal.interface.constant.load[2] : i32
  %3 = hal.interface.constant.load[3] : i32
  %4 = arith.index_castui %0 : i32 to index
  %5 = arith.index_castui %1 : i32 to index
  %6 = arith.index_castui %2 : i32 to index
  %7 = arith.index_castui %3 : i32 to index
  %8 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7}
  %9 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5}
  %10 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %11 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_y]
  %12 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_y]
  %13 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
  %14 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
  scf.for %arg0 = %11 to %6 step %12 {
    %15 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg0)[%6]
    scf.for %arg1 = %13 to %7 step %14 {
      %16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg1)[%7]
      %17 = flow.dispatch.tensor.load %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
      %18 = flow.dispatch.tensor.load %8, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
      %19 = flow.dispatch.tensor.load %9, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5} -> tensor<?x?xf32>
      %dim = tensor.dim %18, %c0 : tensor<?x?xf32>
      %dim_0 = tensor.dim %18, %c1 : tensor<?x?xf32>
      %20 = affine.apply affine_map<()[s0, s1, s2] -> (s1 - (s1 - s0) mod s2)>()[%c0, %dim, %c4]
      %21 = scf.for %arg2 = %c0 to %20 step %c4 iter_args(%arg3 = %17) -> (tensor<?x?xf32>) {
        %23 = tensor.empty(%c4) : tensor<?xf32>
        %24 = affine.apply affine_map<()[s0, s1, s2] -> (s1 - (s1 - s0) mod s2)>()[%c0, %dim_0, %c4]
        %25 = scf.for %arg4 = %c0 to %24 step %c4 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
          %extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %27 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %23 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs =  {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
          ^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
            %28 = arith.maxf %in, %in_3 : f32
            %29 = arith.minf %28, %in_4 : f32
            linalg.yield %29 : f32
          } -> tensor<?x?xf32>
          %inserted_slice = tensor.insert_slice %27 into %arg5[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
          scf.yield %inserted_slice : tensor<?x?xf32>
        }
        %26 = scf.for %arg4 = %24 to %dim_0 step %c4 iter_args(%arg5 = %25) -> (tensor<?x?xf32>) {
          %27 = affine.apply affine_map<(d0, d1) -> (-d0 + d1)>(%arg4, %dim_0)
          %extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %28 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %23 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs =  {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
          ^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
            %29 = arith.maxf %in, %in_3 : f32
            %30 = arith.minf %29, %in_4 : f32
            linalg.yield %30 : f32
          } -> tensor<?x?xf32>
          %inserted_slice = tensor.insert_slice %28 into %arg5[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
          scf.yield %inserted_slice : tensor<?x?xf32>
        }
        scf.yield %26 : tensor<?x?xf32>
      }
      %22 = scf.for %arg2 = %20 to %dim step %c4 iter_args(%arg3 = %21) -> (tensor<?x?xf32>) {
        %23 = affine.apply affine_map<(d0, d1) -> (-d0 + d1)>(%arg2, %dim)
        %24 = tensor.empty(%23) : tensor<?xf32>
        %25 = scf.for %arg4 = %c0 to %dim_0 step %c4 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
          %26 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg4)[%dim_0]
          %extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
          %27 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %24 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs =  {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
          ^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
            %28 = arith.maxf %in, %in_3 : f32
            %29 = arith.minf %28, %in_4 : f32
            linalg.yield %29 : f32
          } -> tensor<?x?xf32>
          %inserted_slice = tensor.insert_slice %27 into %arg5[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
          scf.yield %inserted_slice : tensor<?x?xf32>
        }
        scf.yield %25 : tensor<?x?xf32>
      }
      flow.dispatch.tensor.store %22, %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
    }
  }
  return
}

We can notice that the size of tensor allocation in the second loop becomes a simple affine.apply due to the optimization [1]:

%23 = affine.apply affine_map<(d0, d1) -> (-d0 + d1)>(%arg2, %dim)
%24 = tensor.empty(%23) : tensor<?xf32>

while originally it was:

%21 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg2)[%dim]
%22 = tensor.empty(%21) : tensor<?xf32>

This causes the later LLVMCPUCheckIRBeforeLLVMConversion fails to derive the upper bound of tensor allocation [2].

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I agree with the approach and principles described above. However, from a production perspective, I wonder if we are not prioritizing performance vs support/coverage. We may want the compiler to generate “slower” code that can run instead of bailing out compilation. Is the use of shared memory always mandatory? Are we able to fall back to global memory for cases where we exceed a specific shared memory threshold? Have we considered heap allocation or the use of smart allocators for some targets? (Sorry, lot of questions 😃) I think these points may become more relevant as we invest more in datacenter support.

All this should actually be factored into the tile size selection, and use this limit while deciding the tile sizes / configurations to chose… Ben can probably explain the tradeoffs for the other questions