iree: Unknown upper bound allocation due to LinalgStrategyPeelPass
We found that LinalgStrategyPeelPass
optimizes away some affine.min
used by tensor.empty
, which makes a later check not be able to calculate the upper bound of memory allocation.
This is originally found when debugging a failed test iree_tf_tests/math/llvmcpu__dynamic_dim_softmax.run
with a WIP change (#10770).
Here is a crafted example to reproduce the issue:
func.func @peel_dyn(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%dim0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
%dim1 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
// The alloc_buf is uninitialized to avoid being optimized.
%alloc_buf = tensor.empty(%dim0) : tensor<?xf32>
%out = tensor.empty(%dim0, %dim1) : tensor<?x?xf32>
%res = linalg.generic {
indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>],
iterator_types = ["parallel", "parallel"]
} ins(%arg0, %arg1, %alloc_buf : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%out : tensor<?x?xf32>) {
^bb0(%a: f32, %b: f32, %c: f32, %z: f32):
%w = arith.maxf %a, %b : f32
%x = arith.minf %w, %c : f32
linalg.yield %x : f32
} -> tensor<?x?xf32>
return %res : tensor<?x?xf32>
}
By compiling the example with iree-compile --iree-hal-target-backends=llvm-cpu
, the compiler throws an error:
error: 'memref.alloca' op expected no stack allocations without upper bound shapes
What Happened
Before LinalgStrategyPeelPass, the example is tiled into:
// -----// IR Dump Before LinalgStrategyPeelPass (iree-linalg-strategy-peel-pass) //----- //
func.func @peel_dyn_dispatch_0_generic_DxD() {
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = hal.interface.constant.load[2] : i32
%3 = hal.interface.constant.load[3] : i32
%4 = arith.index_castui %0 : i32 to index
%5 = arith.index_castui %1 : i32 to index
%6 = arith.index_castui %2 : i32 to index
%7 = arith.index_castui %3 : i32 to index
%8 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7}
%9 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5}
%10 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%11 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_y]
%12 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_y]
%13 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
%14 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
scf.for %arg0 = %11 to %6 step %12 {
%15 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg0)[%6]
scf.for %arg1 = %13 to %7 step %14 {
%16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg1)[%7]
%17 = flow.dispatch.tensor.load %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
%18 = flow.dispatch.tensor.load %8, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
%19 = flow.dispatch.tensor.load %9, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5} -> tensor<?x?xf32>
%dim = tensor.dim %18, %c0 : tensor<?x?xf32>
%dim_0 = tensor.dim %18, %c1 : tensor<?x?xf32>
%20 = scf.for %arg2 = %c0 to %dim step %c4 iter_args(%arg3 = %17) -> (tensor<?x?xf32>) {
%21 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg2)[%dim]
%22 = tensor.empty(%21) : tensor<?xf32>
%23 = scf.for %arg4 = %c0 to %dim_0 step %c4 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
%24 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg4)[%dim_0]
%extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%25 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %22 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
%26 = arith.maxf %in, %in_3 : f32
%27 = arith.minf %26, %in_4 : f32
linalg.yield %27 : f32
} -> tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %25 into %arg5[%arg2, %arg4] [%21, %24] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice : tensor<?x?xf32>
}
scf.yield %23 : tensor<?x?xf32>
}
flow.dispatch.tensor.store %20, %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
}
}
return
}
After LinalgStrategyPeelPass, the output is:
// -----// IR Dump After LinalgStrategyPeelPass (iree-linalg-strategy-peel-pass) //----- //
func.func @peel_dyn_dispatch_0_generic_DxD() {
%c1 = arith.constant 1 : index
%c4 = arith.constant 4 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.constant.load[0] : i32
%1 = hal.interface.constant.load[1] : i32
%2 = hal.interface.constant.load[2] : i32
%3 = hal.interface.constant.load[3] : i32
%4 = arith.index_castui %0 : i32 to index
%5 = arith.index_castui %1 : i32 to index
%6 = arith.index_castui %2 : i32 to index
%7 = arith.index_castui %3 : i32 to index
%8 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7}
%9 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5}
%10 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) offset(%c0) alignment(64) : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%11 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_y]
%12 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_y]
%13 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_id_x]
%14 = affine.apply affine_map<()[s0] -> (s0 * 64)>()[%workgroup_count_x]
scf.for %arg0 = %11 to %6 step %12 {
%15 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg0)[%6]
scf.for %arg1 = %13 to %7 step %14 {
%16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 64)>(%arg1)[%7]
%17 = flow.dispatch.tensor.load %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
%18 = flow.dispatch.tensor.load %8, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%6, %7} -> tensor<?x?xf32>
%19 = flow.dispatch.tensor.load %9, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%4, %5} -> tensor<?x?xf32>
%dim = tensor.dim %18, %c0 : tensor<?x?xf32>
%dim_0 = tensor.dim %18, %c1 : tensor<?x?xf32>
%20 = affine.apply affine_map<()[s0, s1, s2] -> (s1 - (s1 - s0) mod s2)>()[%c0, %dim, %c4]
%21 = scf.for %arg2 = %c0 to %20 step %c4 iter_args(%arg3 = %17) -> (tensor<?x?xf32>) {
%23 = tensor.empty(%c4) : tensor<?xf32>
%24 = affine.apply affine_map<()[s0, s1, s2] -> (s1 - (s1 - s0) mod s2)>()[%c0, %dim_0, %c4]
%25 = scf.for %arg4 = %c0 to %24 step %c4 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
%extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%27 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %23 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs = {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
%28 = arith.maxf %in, %in_3 : f32
%29 = arith.minf %28, %in_4 : f32
linalg.yield %29 : f32
} -> tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %27 into %arg5[%arg2, %arg4] [%c4, %c4] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice : tensor<?x?xf32>
}
%26 = scf.for %arg4 = %24 to %dim_0 step %c4 iter_args(%arg5 = %25) -> (tensor<?x?xf32>) {
%27 = affine.apply affine_map<(d0, d1) -> (-d0 + d1)>(%arg4, %dim_0)
%extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%28 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %23 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs = {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
%29 = arith.maxf %in, %in_3 : f32
%30 = arith.minf %29, %in_4 : f32
linalg.yield %30 : f32
} -> tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %28 into %arg5[%arg2, %arg4] [%c4, %27] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice : tensor<?x?xf32>
}
scf.yield %26 : tensor<?x?xf32>
}
%22 = scf.for %arg2 = %20 to %dim step %c4 iter_args(%arg3 = %21) -> (tensor<?x?xf32>) {
%23 = affine.apply affine_map<(d0, d1) -> (-d0 + d1)>(%arg2, %dim)
%24 = tensor.empty(%23) : tensor<?xf32>
%25 = scf.for %arg4 = %c0 to %dim_0 step %c4 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
%26 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg4)[%dim_0]
%extracted_slice = tensor.extract_slice %18[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_1 = tensor.extract_slice %19[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%extracted_slice_2 = tensor.extract_slice %arg5[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%27 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice, %extracted_slice_1, %24 : tensor<?x?xf32>, tensor<?x?xf32>, tensor<?xf32>) outs(%extracted_slice_2 : tensor<?x?xf32>) attrs = {__internal_linalg_transform__ = "1", lowering_config = #iree_codegen.lowering_config<tile_sizes = [[64, 64], [4, 4], [0, 0]]>} {
^bb0(%in: f32, %in_3: f32, %in_4: f32, %out: f32):
%28 = arith.maxf %in, %in_3 : f32
%29 = arith.minf %28, %in_4 : f32
linalg.yield %29 : f32
} -> tensor<?x?xf32>
%inserted_slice = tensor.insert_slice %27 into %arg5[%arg2, %arg4] [%23, %26] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
scf.yield %inserted_slice : tensor<?x?xf32>
}
scf.yield %25 : tensor<?x?xf32>
}
flow.dispatch.tensor.store %22, %10, offsets = [%arg0, %arg1], sizes = [%15, %16], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32>>{%6, %7}
}
}
return
}
We can notice that the size of tensor allocation in the second loop becomes a simple affine.apply
due to the optimization [1]:
%23 = affine.apply affine_map<(d0, d1) -> (-d0 + d1)>(%arg2, %dim)
%24 = tensor.empty(%23) : tensor<?xf32>
while originally it was:
%21 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 4)>(%arg2)[%dim]
%22 = tensor.empty(%21) : tensor<?xf32>
This causes the later LLVMCPUCheckIRBeforeLLVMConversion
fails to derive the upper bound of tensor allocation [2].
About this issue
- Original URL
- State: open
- Created 2 years ago
- Comments: 17 (16 by maintainers)
Commits related to this issue
- Cherry-pick llvm-project commits Also fixes some test cases due to #11196. * e7328a9eb22307d80f86f668a75c2b082ee8636e * be630f07de0c17bade67e6ab6a297db41003775d * 1742882a3496f5a96eb67f665d9c622633f... — committed to matthias-springer/iree by matthias-springer 2 years ago
- Cherry-pick llvm-project commits (#11550) Also fixes some test cases due to #11196. * e7328a9eb22307d80f86f668a75c2b082ee8636e * be630f07de0c17bade67e6ab6a297db41003775d * 1742882a3496f5a96eb67f... — committed to iree-org/iree by matthias-springer 2 years ago
@hanhanW These two changes should remove the alloc + copy: https://reviews.llvm.org/D140007 https://reviews.llvm.org/D140008
All this should actually be factored into the tile size selection, and use this limit while deciding the tile sizes / configurations to chose… Ben can probably explain the tradeoffs for the other questions