iree: Slow Softmax at top of MobileBert/int8 profile

Profiling MobileBert/int8/experimental(mmt4d)/dotprod, where matmuls themselves are relatively fast, makes the rest show more prominently in profiles.

According to Tracy, ~50% of time is being spent in a dispatch that appears to be a Softmax. At least it plausibly looks like one as it performs some table-lookups, a sum-reduction, then evaluates a degree-15 polynomial approximation of a math function and multiplies that together, like a Softmax would. And we know that MobileBert contains lots of Softmax.

TOSA “source” code of the slow loop

My own hand-deciphering of that TOSA code into pseudo-C (where I got to understand that it’s evaluating a degree-15 polynomial, etc).

disassembly from Tracy: image

We can see that it’s scalar code, not SIMD. The x and w registers are scalar registers (64bit and 32bit respectively).

Getting this code to vectorize properly is likely to require some measure of explicit vectorization using at least ARM NEON intrinsics. The reason is that the efficient lowering of these fixed-point multiplications depends on fine details of the target ISA. As the target is ARM NEON, the fixed-point multiplications of the form

(int64(a) * int64(b)) >> 31

should be explicitly vectorized as

vqdmulhq_s32(a, b)

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 26 (22 by maintainers)

Commits related to this issue

Most upvoted comments

Dequantization of softmax has landed (#9337) and the expected performance improvement has materialized on the dashboard (https://github.com/google/iree/pull/9337#issuecomment-1151558041).

@sjarus thanks for the insights, i’ll follow up with you separately on these ramifications. With #9337 we have merely “unblocked” ourselves in that these topics don’t block IREE getting good e2e performance anymore, but they’re still important topics in their own right and eventually they’ll unlock getting softmax to perform even better than a decent dequantized impl.

We should defer the decision about arithmetic smarts to later stages. We don’t want to do it at TOSA -> Linalg stage. If we really want it in IREE, it can be done before fusion.

To add a concrete example to this, there are fusion opportunities for Softmax/Logistic and the ops that process their output, potentially bypassing most of the dequantizing or exp calculation.

In vision models, Softmax is usually followed by some thresholding e.g. discard results where value (probability) < 0.5. This logic can be fused with the Softmax so that only values > 0.5 (for quantized models, the quantized equivalent of 0.5) are included in the exp calculation. In practice, this eliminates the need to run exp on >80% of the values.

The logic would change from: select(softmax(N) > quantized(0.5)) or select(dequantize(softmax(N)) > 0.5) or select(softmax(dequantize(N)) > 0.5)

to:

softmax(select(N > quantized(0.5))) where the output of select() is much lower than N.

small world - this bit references gemmlowp code I wrote years ago. I had above deciphered the constants 180/255 and 120/255 and hadn’t bothered to simplify the fractions – but 255=15*17 and simplifying away the factors of 15, then rescaling by 4 (change of fixedpoint format) leaves the traditional 48/17 and 32/17 coefficients of Newton-Raphson division. it all makes sense now 😃