taichi: IR optimization should not eliminate stmts across the offloaded task boundaries?

Describe the bug

I noticed that the latest IR passes could eliminate the constant across the offloaded task boundaries. This unfortunately broke the Metal backend. (Note that this is not broken at HEAD, it just broke my local works as I’m replacing Metal’s passes with compile_to_offloads().)

On Metal, each offloaded task is created as a separate kernel function. And a ConstStmt is just translated to const auto {stmt->name()} = {stmt->val.stringify()};.

I wonder if this optimization is only done for constants? If so, maybe I can predefine all the ConstStmt as global constants in the generated source code. But if this is applied to non-const, then it’s probably gonna be a real bug. Also, it doesn’t break the LLVM backends. Is this because LLVM would inline the constants directly before producing the machine code?

Log/Screenshots

Detected from test_local_atomics.py::test_implicit_local_atomic_or

Good

kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]  <--
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <i32 x1> $16 = const [0]  <-- OK
    <gen*x1> $17 = [S0root][root]::lookup($15, $16) activate = false
    <gen*x1> $18 = get child [S0root->S1dense] $17
    <gen*x1> $19 = [S1dense][dense]::lookup($18, $16) activate = false
    <i32*x1> $20 = get child [S1dense->S2place_i32] $19
    <i32*x1> $21 : global store [$20 <- $14]
  }
}

Broken

  • Note that $2 in the first offloaded task is referenced in the third one.
kernel {
  $0 = offloaded  {
    <i32*x1> $1 = global tmp var (offset = 0 B)
    <i32 x1> $2 = const [0]  <--
    <i32*x1> $3 : global store [$1 <- $2]
    <i32*x1> $4 = global tmp var (offset = 0 B)
    <i32*x1> $5 : global store [$4 <- $2]
  }
  $6 = offloaded range_for(0, 10) block_dim=adaptive {
    <i32 x1> $7 = const [2]
    <i32 x1> $8 = loop index 0
    <i32 x1> $9 = pow $7 $8
    <i32*x1> $10 = global tmp var (offset = 0 B)
    <i32 x1> $11 = atomic bit_or($10, $9)
  }
  $12 = offloaded  {
    <i32*x1> $13 = global tmp var (offset = 0 B)
    <i32 x1> $14 = global load $13
    <gen*x1> $15 = get root
    <gen*x1> $16 = [S0root][root]::lookup($15, $2) activate = false  <-- refers to $2!
    <gen*x1> $17 = get child [S0root->S1dense] $16
    <gen*x1> $18 = [S1dense][dense]::lookup($17, $2) activate = false  <-- refers to $2!
    <i32*x1> $19 = get child [S1dense->S2place_i32] $18
    <i32*x1> $20 : global store [$19 <- $14]
  }
}

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 21 (13 by maintainers)

Most upvoted comments

Looks like a bug in offload. I’ll fix that tonight.

Re https://github.com/taichi-dev/taichi/issues/729#issuecomment-610991469

Right, after disabling it I got the same IR, and Metal is fine.

Ah i know where i was thinking wrong. I just noticed that in the example you gave, both $1 and $4 pointed to the same byte offset… Sorry about that

however, we should also change all global $4 to global $1 when this optimization is done.

Sorry for my confusing terms. By kernel i was referring to a offloaded task within the same Taichi kernel… I agree with your claim, but it seems like there would be lots of edge cases… E.g. some offloaded task reads global ($4 + 0) … (I don’t think this is actually a possible output in Taichi, but it’s not forbidden in theory…)

“change all global $4 to global $1” could be done with irpass::replace_all_usages_with(). It automatically handles these cases.

With the fix in #730 I get

One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.

Oh yes… It’s indeed non-trivial to optimize them again to compile-time constants. We don’t know if other kernels modify the global memory between the offloads in this kernel.

(sorry i don’t have time for real coding now)

With the fix in #730 I get

One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.

FYI, yesterday i got away with this problem by predefining all the constants at the global level. Thanks to SSA + unique stmt names, this worked fine across different offloaded tasks. I think this can be done for GLSL as well? However, that puts restrictions to how backends can be implemented. Maybe it would be better to split into ConstDefStmt and ConstRefStmt.

That said, i think we should patch in #730, then create a separate issue to track the performance impact because of this?

(Some random thoughts for further optimization: are the following two IRs the same?)

I don’t think so? What if another kernel doesn’t know about global $1 and reads global $4?

Maybe it would be better to rename it to OffLoadedTask to explicitly show that it’s not a plain statement.

+N. I think this came up a few times now when we work on IR…

If we remove all advanced optimizations but lowering Linearized, the bug still exists. It’s introduced in this simplify pass (in Simplified II):

before simplify
==========
kernel {
  <i32 x1> $0 = alloca
  <i32 x1> $1 = const [0]
  <i32 x1> $2 : local store [$0 <- $1]
  <i32 x1> $3 = alloca
  <i32 x1> $4 = const [10]
  for $3 in range($1, $4, step 1) {
    <i32 x1> $6 = const [2]
    <i32 x1> $7 = local load [ [$3[0]]]
    <i32 x1> $8 = pow $6 $7
    <i32 x1> $9 = atomic bit_or($0, $8)
  }
  <i32 x1> $10 = local load [ [$0[0]]]
  <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
  <gen*x1> $12 = get root
  <i32 x1> $13 = linearized(ind {}, stride {})
  <gen*x1> $14 = [S0root][root]::lookup($12, $13) activate = false
  <gen*x1> $15 = get child [S0root->S1dense] $14
  <i32 x1> $16 = linearized(ind {}, stride {})
  <gen*x1> $17 = [S1dense][dense]::lookup($15, $16) activate = false
  <i32*x1> $18 = get child [S1dense->S2place_i32] $17
  <i32 x1> $19 = shuffle $18[0]
  <i32 x1> $20 : global store [$19 <- $10]
}
==========
after simplify
==========
kernel {
  <i32 x1> $0 = alloca
  <i32 x1> $1 = const [0]
  <i32 x1> $2 : local store [$0 <- $1]
  <i32 x1> $3 = alloca
  <i32 x1> $4 = const [10]
  for $3 in range($1, $4, step 1) {
    <i32 x1> $6 = const [2]
    <i32 x1> $7 = local load [ [$3[0]]]
    <i32 x1> $8 = pow $6 $7
    <i32 x1> $9 = atomic bit_or($0, $8)
  }
  <i32 x1> $10 = local load [ [$0[0]]]
  <i32*x1> $11 = global ptr [S2place_i32], index [] activate=true
  <gen*x1> $12 = get root
  <gen*x1> $14 = [S0root][root]::lookup($12, $1) activate = false
  <gen*x1> $15 = get child [S0root->S1dense] $14
  <gen*x1> $17 = [S1dense][dense]::lookup($15, $1) activate = false
  <i32*x1> $18 = get child [S1dense->S2place_i32] $17
  <i32 x1> $20 : global store [$18 <- $10]
}
==========

They are in the same block, so it would be natural to eliminate duplicate const [0]s. However, the block is divided to several offloaded tasks in the end.

@yuanming-hu do you have any idea to systematically deal with this?

Thanks for pointing this out – I wasn’t aware of optimization across offloaded task boundaries. Maybe it would be better to rename it to OffLoadedTask to explicitly show that it’s not a plain statement.

However, I didn’t write an optimization to eliminate constants, so I still need to find where the bug is introduced.

Is compile_to_offloads() called before each of the backends in the latest IR?

This problem will never occur if we treat OffloadedStmt as a block like IfStmt.

I disagree. We have to launch each offloaded task as a separate kernel, so that the GPU backends can properly establish the memory access dependencies between these kernels. Otherwise if you have two offloaded tasks in the same kernel/shader, while they writes/reads the same element, it becomes very difficult to keep the data synchronized across GPU threads.

Or will this also broke in IfStmt?

Probably not.

Also this should also break OpenGL test but it didn’t…

As I mentioned, it’s not broken at HEAD, only in my local repo.