iree: [CPU] i4 pack op fails to compile

It looks like some memref.subviews are not optimized away for this i4 pack op and we try to apply narrow type emulation to it:

#config = #iree_codegen.lowering_config<tile_sizes = [[20000, 16000], [1, 1]]>
#executable_target_system_elf_arm_64_ = #hal.executable.target<"llvm-cpu", "system-elf-arm_64", {cpu = "", cpu_features = "+neon", data_layout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128", link_embedded = false, native_vector_size = 16 : index, target_triple = "aarch64-none-linux-android34", ukernels = "none"}>
#pipeline_layout = #hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer>]>]>
#translation = #iree_codegen.translation_info<CPUDataTiling>
module {
  hal.executable public @pack_i4 {
    hal.executable.variant public @system_elf_arm_64 target(#executable_target_system_elf_arm_64_) {
      hal.executable.export public @pack_i4 ordinal(0) layout(#pipeline_layout) attributes {translation_info = #translation} {
      ^bb0(%arg0: !hal.device):
        %x, %y, %z = flow.dispatch.workgroup_count_from_slice
        hal.return %x, %y, %z : index, index, index
      }
      builtin.module {
        func.func @pack_i4() {
          %c0_i4 = arith.constant 0 : i4
          %c0 = arith.constant 0 : index
          %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<16000x32000xi4>>
          %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<200000x16000x64x1xi4>>
          %2 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [16000, 32000], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<16000x32000xi4>> -> tensor<16000x32000xi4>
          %3 = tensor.empty() : tensor<200000x16000x64x1xi4>
          %pack = tensor.pack %2 padding_value(%c0_i4 : i4) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [64, 1] into %3 {lowering_config = #config} : tensor<16000x32000xi4> -> tensor<200000x16000x64x1xi4>
          flow.dispatch.tensor.store %pack, %1, offsets = [0, 0, 0, 0], sizes = [200000, 16000, 64, 1], strides = [1, 1, 1, 1] : tensor<200000x16000x64x1xi4> -> !flow.dispatch.tensor<writeonly:tensor<200000x16000x64x1xi4>>
          return
        }
      }
    }
  }
}

Error:

iree-compile --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features="+neon" --iree-llvmcpu-target-triple=aarch64-none-linux-android34 --iree-opt-dcata-tiling=true --iree-llvmcpu-enable-ukernels=none --compile-from=executable-sources repro.mlir

repro.mlir:21:19: error: failed to legalize operation 'memref.subview' that was explicitly marked illegal                                                                              
          %pack = tensor.pack %2 padding_value(%c0_i4 : i4) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [64, 1] into %3 {lowering_config = #config} : tensor<16000x3
2000xi4> -> tensor<200000x16000x64x1xi4>                                                                                                                                               
                  ^                                                                                                                                                                    
repro.mlir:21:19: note: see current operation: %29 = "memref.subview"(%7, %27, %28, %24) <{operandSegmentSizes = array<i32: 1, 2, 1, 0>, static_offsets = array<i64: -92233720368547758
08, -9223372036854775808>, static_sizes = array<i64: 1, -9223372036854775808>, static_strides = array<i64: 1, 1>}> : (memref<16000x32000xi4>, index, index, index) -> memref<1x?xi4, strided<[32000, 1], offset: ?>>

IR before the compilation error:

// -----// IR Dump After FoldMemRefAliasOps (fold-memref-alias-ops) //----- //
module {
  func.func @pack_i4() {
    %c1 = arith.constant 1 : index
    %c20000 = arith.constant 20000 : index
    %c16000 = arith.constant 16000 : index
    %c200000 = arith.constant 200000 : index
    %c0_i4 = arith.constant 0 : i4
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16000x32000xi4>
    memref.assume_alignment %0, 64 : memref<16000x32000xi4>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : memref<200000x16000x64x1xi4>
    memref.assume_alignment %1, 64 : memref<200000x16000x64x1xi4>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    %workgroup_id_y = hal.interface.workgroup.id[1] : index
    %workgroup_count_y = hal.interface.workgroup.count[1] : index
    %2 = affine.apply affine_map<()[s0] -> (s0 * 20000)>()[%workgroup_id_y]
    %3 = affine.apply affine_map<()[s0] -> (s0 * 20000)>()[%workgroup_count_y]
    %4 = affine.apply affine_map<()[s0] -> (s0 * 16000)>()[%workgroup_id_x]
    %5 = affine.apply affine_map<()[s0] -> (s0 * 16000)>()[%workgroup_count_x]
    cf.br ^bb1(%2 : index)
  ^bb1(%6: index):  // 2 preds: ^bb0, ^bb11
    %7 = arith.cmpi slt, %6, %c200000 : index
    cf.cond_br %7, ^bb2, ^bb12
  ^bb2:  // pred: ^bb1
    cf.br ^bb3(%4 : index)
  ^bb3(%8: index):  // 2 preds: ^bb2, ^bb10
    %9 = arith.cmpi slt, %8, %c16000 : index
    cf.cond_br %9, ^bb4, ^bb11
  ^bb4:  // pred: ^bb3
    cf.br ^bb5(%c0 : index)
  ^bb5(%10: index):  // 2 preds: ^bb4, ^bb9
    %11 = arith.cmpi slt, %10, %c20000 : index
    cf.cond_br %11, ^bb6, ^bb10
  ^bb6:  // pred: ^bb5
    %12 = affine.min affine_map<()[s0, s1] -> (s0 * -64 - s1 * 64 + 32000, 64)>()[%10, %6]
    cf.br ^bb7(%c0 : index)
  ^bb7(%13: index):  // 2 preds: ^bb6, ^bb8
    %14 = arith.cmpi slt, %13, %c16000 : index
    cf.cond_br %14, ^bb8, ^bb9
  ^bb8:  // pred: ^bb7
    %15 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%8, %13]
    %16 = affine.apply affine_map<()[s0, s1] -> (s0 * 64 + s1 * 64)>()[%6, %10]
    %subview = memref.subview %0[%15, %16] [1, %12] [1, 1] : memref<16000x32000xi4> to memref<1x?xi4, strided<[32000, 1], offset: ?>>
    %17 = vector.transfer_read %subview[%c0, %c0], %c0_i4 : memref<1x?xi4, strided<[32000, 1], offset: ?>>, vector<64xi4>
    %18 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%6, %10]
    %19 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%8, %13]
    vector.store %17, %1[%18, %19, %c0, %c0] : memref<200000x16000x64x1xi4>, vector<64xi4>
    %20 = arith.addi %13, %c1 : index
    cf.br ^bb7(%20 : index)
  ^bb9:  // pred: ^bb7
    %21 = arith.addi %10, %c1 : index
    cf.br ^bb5(%21 : index)
  ^bb10:  // pred: ^bb5
    %22 = arith.addi %8, %5 : index
    cf.br ^bb3(%22 : index)
  ^bb11:  // pred: ^bb3
    %23 = arith.addi %6, %3 : index
    cf.br ^bb1(%23 : index)
  ^bb12:  // pred: ^bb1
    return
  }
}

About this issue

Original URL
State: open
Created 5 months ago
Comments: 24 (24 by maintainers)

Commits related to this issue

[CPU] Enforce masking in DT pipeline The current vectorization patterns for `tensor.pad` generate out of bounds transfer reads when masking is not enable (e.g., Arm Neon). This leads to compilation e... — committed to dcaballe/iree by dcaballe 5 months ago
[CPU] Enforce masking in DT pipeline The current vectorization patterns for `tensor.pad` generate out of bounds transfer reads when masking is not enable (e.g., Arm Neon). This leads to compilation e... — committed to dcaballe/iree by dcaballe 5 months ago
[mlir][MemRef] Extend `memref.subview` sub-byte type emulation support In some cases (see https://github.com/iree-org/iree/issues/16285), `memref.subview` ops can't be folded into transfer ops and su... — committed to dcaballe/llvm-project by dcaballe 2 months ago
[mlir][MemRef] Extend `memref.subview` sub-byte type emulation support In some cases (see https://github.com/iree-org/iree/issues/16285), `memref.subview` ops can't be folded into transfer ops and su... — committed to dcaballe/llvm-project by dcaballe 2 months ago

Most upvoted comments

Thanks for the quick response, appreciate it!

@dcaballe why do we have memref.collapse_shape? Did you enable vector flatten in your local build?

Yes, I’m experimenting with this here: #16456. You may have also seen some fixed/extensions on that pass. More to come!

Regarding the emulation discussion, I’m not sure I totally understand. Why does everything have to be folded into the transfer read/write ops? Memref subview and collapse ops are only index computation so it should be easy to add emulation for them directly. As we discussed in llvm/llvm-project#80517 , it’s unlikely we can fold all the memrefs subviews into their transfer read/write op consumers so we may need direct emulation support for them. Multi dimensional support should be ok as long as the number of elements in the contiguous dimension by their bitwidth is multiple of 1 byte. Am I missing something?

That is a very restricted case and can only support static shapes. So I dont count them as something that can be “supported”. It will force you to walk a very tight path. Moreover I think it is unnecessary. If you cannot fold a memref.subview with the load/store then there are going to be other issues anyway. So adding type propagation for subviews is treating a symptom and not the cause (and is unnecessarily complicated). https://github.com/llvm/llvm-project/pull/80517 is really pointing a pit fall in vector.transfer_read semantics. So again thats a symptom, not the root cause.

MaheshRavishankar on Feb 19, 2024

Thanks! I gave that a try: https://github.com/llvm/llvm-project/pull/80517 It fixes this issue.

dcaballe on Feb 3, 2024