iree: [CUDA] Reduction extremely slow

What happened?

The reduction below takes ~120ms on my A100 to reduce a tensor of size 833x833xf32 which seems off by over 1000x (an 800x800x800 matmul is measured in microseconds on this chip) . I tried making the size 832 to see if alignment was the problem but that did not significantly change the runtime, so I think something more fundamental is going on.

Repro with:

iree-compile --iree-hal-target-backends=cuda --iree-input-type=mhlo --iree-hal-cuda-llvm-target-arch=sm_80 trace_inner.mlir -o trace_inner.vmfb
iree-benchmark-module --function=main --module=trace_inner.vmfb --device=cuda --input=833xi32=0 --input=833x833xf32=0 --input=f32=0
#map = affine_map<(d0, d1) -> (d0)>
#map1 = affine_map<(d0, d1) -> (d1)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> ()>
module {
  func.func @main(%arg0: tensor<833xi32>, %arg1: tensor<833x833xf32>, %arg2: tensor<f32>) -> tensor<f32> {
    %cst = arith.constant 5.66893432E-4 : f32
    %0 = tensor.empty() : tensor<f32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<f32>) -> tensor<f32>
    %2 = linalg.generic {indexing_maps = [#map, #map1, #map2, #map3, #map3], iterator_types = ["reduction", "reduction"]} ins(%arg0, %arg0, %arg1, %arg2 : tensor<833xi32>, tensor<833xi32>, tensor<833x833xf32>, tensor<f32>) outs(%1 : tensor<f32>) {
    ^bb0(%in: i32, %in_0: i32, %in_1: f32, %in_2: f32, %out: f32):
      %3 = arith.divf %in_1, %in_2 : f32
      %4 = arith.cmpi eq, %in, %in_0 : i32
      %5 = arith.select %4, %3, %cst : f32
      %6 = arith.addf %out, %5 : f32
      linalg.yield %6 : f32
    } -> tensor<f32>
    return %2 : tensor<f32>
  }
}

Steps to reproduce your issue

See above

What component(s) does this issue relate to?

Compiler

Version information

iree @ ab37989652aed11f7f46498c09b9ac515c83eaa3

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 35 (25 by maintainers)

Commits related to this issue

Most upvoted comments

With #13730 this addresses the comment #13285 (comment) . Now the issues comes down to single large reduction dimension that makes it same as #13115

I am moving this back to Nicolas since the blocker to make this addressable through #13115 is now removed.

Ok, the intuition actually held that we can reduce to #13115 by avoiding too much fusion + reshape. Thanks for investigating!

Can we actually close this bug completely (@silvasean ) ?

We either need a new transform as I described, alternatively, we could make the lowering from xxHLO more friendly by lowering to the form I laid out.

This will need a new owner, as this is not covered by the same transform that I was thinking about originally.

More analysis:

  1. Making the tensor aligned (832x832) does not help generate parallel code.
  2. Removing EW ops and putting a single reduction (833x833) does not help generate parallel code.
  3. Single op with aligned reduction does generate parallel code.

So, we have a mixed situation. I would approach this way.

  1. support unaligned reduction case.
  2. change the fusion to not fuse EW operation producer with reduction consumer. #13308 may help. Or, make the reduction codegen path more robust to support this case.

@julianwa please prioritize the issue and assign it to someone. An unaligned reduction support may need a substantial code change.