taichi: IR optimization should not eliminate stmts across the offloaded task boundaries?
Describe the bug
I noticed that the latest IR passes could eliminate the constant across the offloaded task boundaries. This unfortunately broke the Metal backend. (Note that this is not broken at HEAD, it just broke my local works as I’m replacing Metal’s passes with compile_to_offloads().)
On Metal, each offloaded task is created as a separate kernel function. And a ConstStmt is just translated to const auto {stmt->name()} = {stmt->val.stringify()};.
I wonder if this optimization is only done for constants? If so, maybe I can predefine all the ConstStmt as global constants in the generated source code. But if this is applied to non-const, then it’s probably gonna be a real bug. Also, it doesn’t break the LLVM backends. Is this because LLVM would inline the constants directly before producing the machine code?
Log/Screenshots
Detected from test_local_atomics.py::test_implicit_local_atomic_or
Good
kernel {
$0 = offloaded {
<i32*x1> $1 = global tmp var (offset = 0 B)
<i32 x1> $2 = const [0] <--
<i32*x1> $3 : global store [$1 <- $2]
<i32*x1> $4 = global tmp var (offset = 0 B)
<i32*x1> $5 : global store [$4 <- $2]
}
$6 = offloaded range_for(0, 10) block_dim=adaptive {
<i32 x1> $7 = const [2]
<i32 x1> $8 = loop index 0
<i32 x1> $9 = pow $7 $8
<i32*x1> $10 = global tmp var (offset = 0 B)
<i32 x1> $11 = atomic bit_or($10, $9)
}
$12 = offloaded {
<i32*x1> $13 = global tmp var (offset = 0 B)
<i32 x1> $14 = global load $13
<gen*x1> $15 = get root
<i32 x1> $16 = const [0] <-- OK
<gen*x1> $17 = [S0root][root]::lookup($15, $16) activate = false
<gen*x1> $18 = get child [S0root->S1dense] $17
<gen*x1> $19 = [S1dense][dense]::lookup($18, $16) activate = false
<i32*x1> $20 = get child [S1dense->S2place_i32] $19
<i32*x1> $21 : global store [$20 <- $14]
}
}
Broken
- Note that
$2in the first offloaded task is referenced in the third one.
kernel {
$0 = offloaded {
<i32*x1> $1 = global tmp var (offset = 0 B)
<i32 x1> $2 = const [0] <--
<i32*x1> $3 : global store [$1 <- $2]
<i32*x1> $4 = global tmp var (offset = 0 B)
<i32*x1> $5 : global store [$4 <- $2]
}
$6 = offloaded range_for(0, 10) block_dim=adaptive {
<i32 x1> $7 = const [2]
<i32 x1> $8 = loop index 0
<i32 x1> $9 = pow $7 $8
<i32*x1> $10 = global tmp var (offset = 0 B)
<i32 x1> $11 = atomic bit_or($10, $9)
}
$12 = offloaded {
<i32*x1> $13 = global tmp var (offset = 0 B)
<i32 x1> $14 = global load $13
<gen*x1> $15 = get root
<gen*x1> $16 = [S0root][root]::lookup($15, $2) activate = false <-- refers to $2!
<gen*x1> $17 = get child [S0root->S1dense] $16
<gen*x1> $18 = [S1dense][dense]::lookup($17, $2) activate = false <-- refers to $2!
<i32*x1> $19 = get child [S1dense->S2place_i32] $18
<i32*x1> $20 : global store [$19 <- $14]
}
}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 2
- Comments: 21 (13 by maintainers)
Looks like a bug in
offload. I’ll fix that tonight.Re https://github.com/taichi-dev/taichi/issues/729#issuecomment-610991469
Right, after disabling it I got the same IR, and Metal is fine.
Ah i know where i was thinking wrong. I just noticed that in the example you gave, both
$1and$4pointed to the same byte offset… Sorry about that“change all global $4 to global $1” could be done with
irpass::replace_all_usages_with(). It automatically handles these cases.Oh yes… It’s indeed non-trivial to optimize them again to compile-time constants. We don’t know if other kernels modify the global memory between the offloads in this kernel.
(sorry i don’t have time for real coding now)
One concern i have is that, this turns a compile-time constant into a runtime one, and it now needs to read global memory.
FYI, yesterday i got away with this problem by predefining all the constants at the global level. Thanks to SSA + unique stmt names, this worked fine across different offloaded tasks. I think this can be done for GLSL as well? However, that puts restrictions to how backends can be implemented. Maybe it would be better to split into
ConstDefStmtandConstRefStmt.That said, i think we should patch in #730, then create a separate issue to track the performance impact because of this?
I don’t think so? What if another kernel doesn’t know about
global $1and readsglobal $4?+N. I think this came up a few times now when we work on IR…
If we remove all advanced optimizations but lowering
Linearized, the bug still exists. It’s introduced in thissimplifypass (inSimplified II):They are in the same block, so it would be natural to eliminate duplicate
const [0]s. However, the block is divided to several offloaded tasks in the end.@yuanming-hu do you have any idea to systematically deal with this?
Thanks for pointing this out – I wasn’t aware of optimization across offloaded task boundaries. Maybe it would be better to rename it to
OffLoadedTaskto explicitly show that it’s not a plain statement.However, I didn’t write an optimization to eliminate constants, so I still need to find where the bug is introduced.
Is
compile_to_offloads()called before each of the backends in the latest IR?I disagree. We have to launch each offloaded task as a separate kernel, so that the GPU backends can properly establish the memory access dependencies between these kernels. Otherwise if you have two offloaded tasks in the same kernel/shader, while they writes/reads the same element, it becomes very difficult to keep the data synchronized across GPU threads.
Probably not.
As I mentioned, it’s not broken at HEAD, only in my local repo.