iree: Segmentation fault running OPT1.3b f32 with data tiling enabled
What happened?
I am continuing from https://github.com/nod-ai/SHARK/issues/1589 in an investigation of the segmentation faults that occur when trying to run OPT-1.3b at f32 precision with --iree-flow-enable-data-tiling
.
I am having some trouble making a dispatch-level reproducer but will share the smallest reproduction as well as relevant IR.
I have compiled with --iree-flow-break-dispatch=@forward:24
and the data tiling flag, the resulting .vmfb successfully runs through iree-benchmark-module. With --iree-flow-break-dispatch=@forward:25
, the .vmfb segfaults in iree-benchmark-module.
Full reproduction steps are given below for this case, and here is a download link to the full IR dump after iree-flow-outline-dispatch-regions
.
From the above IR dump I’ve isolated the func.func from dispatch 25:
builtin.module {
func.func @forward_dispatch_25_generic_128x2048_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<128x2048xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2048xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>) {
%cst = arith.constant 0.000000e+00 : f32
%cst_0 = arith.constant 2.048000e+03 : f32
%cst_1 = arith.constant 9.99999974E-6 : f32
%0 = flow.dispatch.tensor.load %arg0, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128x2048xf32>> -> tensor<128x2048xf32>
%1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [2048], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2048xf32>> -> tensor<2048xf32>
%2 = tensor.empty() : tensor<128x2048xf32>
%3 = tensor.empty() : tensor<128xf32>
%4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<128xf32>) -> tensor<128xf32>
%5 = linalg.generic {indexing_maps = [#map4, #map5], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<128x2048xf32>) outs(%4 : tensor<128xf32>) {
^bb0(%in: f32, %out: f32):
%8 = arith.mulf %in, %in : f32
%9 = arith.addf %8, %out : f32
linalg.yield %9 : f32
} -> tensor<128xf32>
%6 = linalg.generic {indexing_maps = [#map4, #map5, #map6, #map4], iterator_types = ["parallel", "parallel"]} ins(%0, %5, %1 : tensor<128x2048xf32>, tensor<128xf32>, tensor<2048xf32>) outs(%2 : tensor<128x2048xf32>) {
^bb0(%in: f32, %in_2: f32, %in_3: f32, %out: f32):
%8 = arith.divf %in_2, %cst_0 : f32
%9 = arith.addf %8, %cst_1 : f32
%10 = math.rsqrt %9 : f32
%11 = arith.mulf %in, %10 : f32
%12 = arith.addf %11, %in_3 : f32
linalg.yield %12 : f32
} -> tensor<128x2048xf32>
%7 = iree_linalg_ext.set_encoding %6 : tensor<128x2048xf32> -> tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>
flow.dispatch.tensor.store %7, %arg2, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>> -> !flow.dispatch.tensor<writeonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>
return
}
}
Steps to reproduce your issue
- Download opt-1_3b_causallm_128_torch.mlir
- Run :
iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-flow-break-dispatch=@forward:25 --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb
- Run :
iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task
What component(s) does this issue relate to?
No response
Version information
6e499159b393846441cc28b30a6791dc46221ec4
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (12 by maintainers)
Commits related to this issue
- data-tiling: introduce `upper_bound_tile_size` op to defer padding-size choice to MaterializeEncoding. (#14349) This fixes #11632, by introducing a materializable `upper_bound_tile_size ` instead of... — committed to iree-org/iree by bjacob a year ago
- data-tiling: introduce `upper_bound_tile_size` op to defer padding-size choice to MaterializeEncoding. (#14349) This fixes #11632, by introducing a materializable `upper_bound_tile_size ` instead of ... — committed to plaidml/iree by bjacob a year ago
Good news time - this is fixed by https://github.com/openxla/iree/pull/14349.
This is actually a compiler bug.
The
iree-compile
command line causes an assertion failure. Because assertions are disabled in release builds, it was continuing silently with broken compiler logic, producing a faulty bytecode module, causing that runtime crash in theiree-benchmark-module
command line. But the root cause is the compiler bug - and it shows as this assertion failure in airee-compile
build with assertions enabled:Since the assertion failure is about exactly the kind of thing that https://github.com/openxla/iree/pull/14349 is refactoring, I gave it a try, and it does succeed – the
iree-compile
command succeeds and the resulting bytecode module runs fine iniree-benchmark-module
.Note - for running locally on my machine I had to drop the
--iree-llvmcpu-target-cpu=cascadelake
flag.(EDIT - moving that performance discussion back to https://github.com/nod-ai/SHARK/issues/1589#issuecomment-1640677804 )
@monorimet - Current benchmark results give a flavor of performance to expect. Note - testing on a Intel Skylake-XEON CPU with AVX-512. Compiling with
--iree-llvmcpu-target-cpu=skylake-avx512
. Command lines as in the original PR description above.So, data-tiling alone is a ~ 8x speedup. Ukernels alone are not yet good. But I’ll get to that now, and it will be at least as fast as non-ukernels and in some cases faster. What’s almost certainly happening here is that this particular model is
f32
, andf32
matmuls on ISAs like AVX-512 are what default codegen is good at. As soon as we depart from that, e.g. f16, things are more challenging for default codegen and the ukernels become more of a win.Filed https://github.com/openxla/iree/issues/14406 with minimized testcase. It does look related to fusions, as it only triggers when sufficiently many of these
linalg.generic
’s are chained, preventing further minimization of the testcase.Confirmed that the updated #14349 avoids the issue from https://github.com/openxla/iree/issues/14398#issuecomment-1635196805 (independently of @hanhanW 's fix to the underlying problem). Still debugging some apparent compile-time regression before I merge, but this should be unblocked.
OK. I will wait on the perf results. I happened upon an issue with sequence length 8, which, with the latest flags and #14349, gives a compile-time error
./opt-1_3b_causallm_8_torch.mlir:865:12: error: 'memref.alloca' op all stack allocations need to be hoisted to the entry block of the function
.I’m sure the
M < 16
here is causing it to take a different path. We can prioritize the functional path and getting avx512 first but it seems #14349 was intended (in part) to address the narrow matmul cases.To reproduce (with build on changes from #14349):
I’m here to help with debugging if you need an extra pair of eyes or hands. I’ll be testing cases to see if I can help narrow down either of these issues unless you need my efforts pointed elsewhere.
edit : the
m<16
case we can file a separate issue for and address once avx512 and data-tiling are playing nice.Yes, this is just a debugging step. No need to look into perf results now, we don’t want to forego AVX512 if the target is AVX512-capable. So now we know that there are two separate issues there:
target-cpu=cascadelake
and not passing it, for af32
model). I also have a AVX512 machine so I’ll try reproducing and debugging that there.