iree: [CUDA] Reduction extremely slow
What happened?
The reduction below takes ~120ms on my A100 to reduce a tensor of size 833x833xf32 which seems off by over 1000x (an 800x800x800 matmul is measured in microseconds on this chip) . I tried making the size 832 to see if alignment was the problem but that did not significantly change the runtime, so I think something more fundamental is going on.
Repro with:
iree-compile --iree-hal-target-backends=cuda --iree-input-type=mhlo --iree-hal-cuda-llvm-target-arch=sm_80 trace_inner.mlir -o trace_inner.vmfb
iree-benchmark-module --function=main --module=trace_inner.vmfb --device=cuda --input=833xi32=0 --input=833x833xf32=0 --input=f32=0
#map = affine_map<(d0, d1) -> (d0)>
#map1 = affine_map<(d0, d1) -> (d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> ()>
module {
func.func @main(%arg0: tensor<833xi32>, %arg1: tensor<833x833xf32>, %arg2: tensor<f32>) -> tensor<f32> {
%cst = arith.constant 5.66893432E-4 : f32
%0 = tensor.empty() : tensor<f32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<f32>) -> tensor<f32>
%2 = linalg.generic {indexing_maps = [#map, #map1, #map2, #map3, #map3], iterator_types = ["reduction", "reduction"]} ins(%arg0, %arg0, %arg1, %arg2 : tensor<833xi32>, tensor<833xi32>, tensor<833x833xf32>, tensor<f32>) outs(%1 : tensor<f32>) {
^bb0(%in: i32, %in_0: i32, %in_1: f32, %in_2: f32, %out: f32):
%3 = arith.divf %in_1, %in_2 : f32
%4 = arith.cmpi eq, %in, %in_0 : i32
%5 = arith.select %4, %3, %cst : f32
%6 = arith.addf %out, %5 : f32
linalg.yield %6 : f32
} -> tensor<f32>
return %2 : tensor<f32>
}
}
Steps to reproduce your issue
See above
What component(s) does this issue relate to?
Compiler
Version information
iree @ ab37989652aed11f7f46498c09b9ac515c83eaa3
Additional context
No response
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 35 (25 by maintainers)
Commits related to this issue
- Move collapse of reduction dimension earlier in the Flow pipeline. Collapsing multiple reduction dimensions into a single reduction dimension early in the pipeline will make sure we dont fold into th... — committed to MaheshRavishankar/iree by deleted user a year ago
- Move collapse of reduction dimension earlier in the Flow pipeline. Collapsing multiple reduction dimensions into a single reduction dimension early in the pipeline will make sure we dont fold into th... — committed to MaheshRavishankar/iree by deleted user a year ago
- Move collapse of reduction dimension earlier in the Flow pipeline. Collapsing multiple reduction dimensions into a single reduction dimension early in the pipeline will make sure we dont fold into th... — committed to MaheshRavishankar/iree by deleted user a year ago
- Move collapse of reduction dimension earlier in the Flow pipeline. (#13730) Collapsing multiple reduction dimensions into a single reduction dimension early in the pipeline will make sure we dont fol... — committed to iree-org/iree by MaheshRavishankar a year ago
- Move collapse of reduction dimension earlier in the Flow pipeline. (#13730) Collapsing multiple reduction dimensions into a single reduction dimension early in the pipeline will make sure we dont fol... — committed to NatashaKnk/iree by MaheshRavishankar a year ago
- Move collapse of reduction dimension earlier in the Flow pipeline. (#13730) Collapsing multiple reduction dimensions into a single reduction dimension early in the pipeline will make sure we dont fol... — committed to plaidml/iree by MaheshRavishankar a year ago
Ok, the intuition actually held that we can reduce to #13115 by avoiding too much fusion + reshape. Thanks for investigating!
Can we actually close this bug completely (@silvasean ) ?
We either need a new transform as I described, alternatively, we could make the lowering from xxHLO more friendly by lowering to the form I laid out.
This will need a new owner, as this is not covered by the same transform that I was thinking about originally.
More analysis:
So, we have a mixed situation. I would approach this way.
@julianwa please prioritize the issue and assign it to someone. An unaligned reduction support may need a substantial code change.