iree: Segmentation fault running OPT1.3b f32 with data tiling enabled

What happened?

I am continuing from https://github.com/nod-ai/SHARK/issues/1589 in an investigation of the segmentation faults that occur when trying to run OPT-1.3b at f32 precision with --iree-flow-enable-data-tiling.

I am having some trouble making a dispatch-level reproducer but will share the smallest reproduction as well as relevant IR.

I have compiled with --iree-flow-break-dispatch=@forward:24 and the data tiling flag, the resulting .vmfb successfully runs through iree-benchmark-module. With --iree-flow-break-dispatch=@forward:25, the .vmfb segfaults in iree-benchmark-module.

Full reproduction steps are given below for this case, and here is a download link to the full IR dump after iree-flow-outline-dispatch-regions.

From the above IR dump I’ve isolated the func.func from dispatch 25:

builtin.module {
  func.func @forward_dispatch_25_generic_128x2048_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<128x2048xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2048xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 2.048000e+03 : f32
    %cst_1 = arith.constant 9.99999974E-6 : f32
    %0 = flow.dispatch.tensor.load %arg0, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128x2048xf32>> -> tensor<128x2048xf32>
    %1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [2048], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2048xf32>> -> tensor<2048xf32>
    %2 = tensor.empty() : tensor<128x2048xf32>
    %3 = tensor.empty() : tensor<128xf32>
    %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<128xf32>) -> tensor<128xf32>
    %5 = linalg.generic {indexing_maps = [#map4, #map5], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<128x2048xf32>) outs(%4 : tensor<128xf32>) {
    ^bb0(%in: f32, %out: f32):
      %8 = arith.mulf %in, %in : f32
      %9 = arith.addf %8, %out : f32
      linalg.yield %9 : f32
    } -> tensor<128xf32>
    %6 = linalg.generic {indexing_maps = [#map4, #map5, #map6, #map4], iterator_types = ["parallel", "parallel"]} ins(%0, %5, %1 : tensor<128x2048xf32>, tensor<128xf32>, tensor<2048xf32>) outs(%2 : tensor<128x2048xf32>) {
    ^bb0(%in: f32, %in_2: f32, %in_3: f32, %out: f32):
      %8 = arith.divf %in_2, %cst_0 : f32
      %9 = arith.addf %8, %cst_1 : f32
      %10 = math.rsqrt %9 : f32
      %11 = arith.mulf %in, %10 : f32
      %12 = arith.addf %11, %in_3 : f32
      linalg.yield %12 : f32
    } -> tensor<128x2048xf32>
    %7 = iree_linalg_ext.set_encoding %6 : tensor<128x2048xf32> -> tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>
    flow.dispatch.tensor.store %7, %arg2, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>> -> !flow.dispatch.tensor<writeonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>
    return
  }
}

Steps to reproduce your issue

  1. Download opt-1_3b_causallm_128_torch.mlir
  2. Run :
iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-flow-break-dispatch=@forward:25 --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb
  1. Run :
iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task

What component(s) does this issue relate to?

No response

Version information

6e499159b393846441cc28b30a6791dc46221ec4

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Good news time - this is fixed by https://github.com/openxla/iree/pull/14349.

This is actually a compiler bug.

The iree-compile command line causes an assertion failure. Because assertions are disabled in release builds, it was continuing silently with broken compiler logic, producing a faulty bytecode module, causing that runtime crash in the iree-benchmark-module command line. But the root cause is the compiler bug - and it shows as this assertion failure in a iree-compile build with assertions enabled:

iree-compile: iree/compiler/src/iree/compiler/Dialect/HAL/IR/HALTypes.cpp:139: std::optional<int32_t> mlir::iree_compiler::IREE::HAL::getEncodingTypeValue(mlir::Attribute): Assertion `!attr && "encoding types other than default not yet supported"' failed.
Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
Stack dump:
0.      Program arguments: /usr/local/google/home/benoitjacob/iree-build-linux/tools/iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-flow-break-dispatch=@forward:25 --iree-llvmcpu-stack-allocation-limit=131072 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb

Since the assertion failure is about exactly the kind of thing that https://github.com/openxla/iree/pull/14349 is refactoring, I gave it a try, and it does succeed – the iree-compile command succeeds and the resulting bytecode module runs fine in iree-benchmark-module.

Note - for running locally on my machine I had to drop the --iree-llvmcpu-target-cpu=cascadelake flag.

(EDIT - moving that performance discussion back to https://github.com/nod-ai/SHARK/issues/1589#issuecomment-1640677804 )

@monorimet - Current benchmark results give a flavor of performance to expect. Note - testing on a Intel Skylake-XEON CPU with AVX-512. Compiling with --iree-llvmcpu-target-cpu=skylake-avx512. Command lines as in the original PR description above.

  • Without data-tiling and ukernels: 515 ms
  • With data-tiling but not ukernels: 72 ms
  • With data-tiling and ukernels: 3100 ms

So, data-tiling alone is a ~ 8x speedup. Ukernels alone are not yet good. But I’ll get to that now, and it will be at least as fast as non-ukernels and in some cases faster. What’s almost certainly happening here is that this particular model is f32, and f32 matmuls on ISAs like AVX-512 are what default codegen is good at. As soon as we depart from that, e.g. f16, things are more challenging for default codegen and the ukernels become more of a win.

Filed https://github.com/openxla/iree/issues/14406 with minimized testcase. It does look related to fusions, as it only triggers when sufficiently many of these linalg.generic’s are chained, preventing further minimization of the testcase.

Confirmed that the updated #14349 avoids the issue from https://github.com/openxla/iree/issues/14398#issuecomment-1635196805 (independently of @hanhanW 's fix to the underlying problem). Still debugging some apparent compile-time regression before I merge, but this should be unblocked.

OK. I will wait on the perf results. I happened upon an issue with sequence length 8, which, with the latest flags and #14349, gives a compile-time error ./opt-1_3b_causallm_8_torch.mlir:865:12: error: 'memref.alloca' op all stack allocations need to be hoisted to the entry block of the function.

I’m sure the M < 16 here is causing it to take a different path. We can prioritize the functional path and getting avx512 first but it seems #14349 was intended (in part) to address the narrow matmul cases.

To reproduce (with build on changes from #14349):

  1. Download opt_1-3b_causallm_8_torch.mlir
  2. Run:
iree-compile ./opt-1_3b_causallm_8_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=+avx,+avx2,+fma --iree-llvmcpu-target-cpu=haswell --iree-llvmcpu-stack-allocation-limit=140000 --iree-flow-enable-data-tiling --iree-llvmcpu-enable-microkernels -o opt_1-3b_causallm_8_torch_cpu.vmfb

I’m here to help with debugging if you need an extra pair of eyes or hands. I’ll be testing cases to see if I can help narrow down either of these issues unless you need my efforts pointed elsewhere.

edit : the m<16 case we can file a separate issue for and address once avx512 and data-tiling are playing nice.

Yes, this is just a debugging step. No need to look into perf results now, we don’t want to forego AVX512 if the target is AVX512-capable. So now we know that there are two separate issues there:

  1. A compiler issue, which I was able to reproduce, and which is fixed by https://github.com/openxla/iree/pull/14349.
  2. A runtime issue, which is specific to AVX512 (the difference between passing target-cpu=cascadelake and not passing it, for a f32 model). I also have a AVX512 machine so I’ll try reproducing and debugging that there.