go: runtime: add mul intrinsic for overflow checks

This is neither a language change nor introducing or changing a public api but discussing a runtime implementation detail.

Motivation

Profiling over production programs reveals that growslice and makemap are hot functions (even outside the malloc part). I am working on to improve their performance and correctness for go1.10.

One part of these functions is at least one check for len > _MaxMem/elemSize which contains a costly division and needs a previous check for elemSize != 0.

(Update: the check can also have variants like len*elemSize > _MaxMem-smallAmount or roundup(len*elemSize) > _MaxMem with an overflow check like uintptr(size) > maxSliceCap(elemSize) before that in the general case involves a division )

We have resorted to switches with constant divisions in growslice to improve performance for common cases. This introduces branch mis predicts and a larger function size which is bad for icaches.

Elsewhere we use maxSliceCap with precomputed values (_MaxMem/elemSize) which introduces cache misses, uses additional memory and only works for small element sizes.

An alternative to the above optimizations that avoids data cache lookups and a division is to check len*elemSize > _MaxMem. Care needs to be taken that len*elemSize does not overflow.

To be able to check overflow quickly and safely i propose the following:

Proposal

Add a new unexported runtime function runtime.mul:

func mul(a, b uint) (uint, bool) {
	return a * b, b != 0 && a > ^uint(0)/b
}

The function can be used like:

if n, overflow := mul(len, elemSize); overflow || n > _MaxMem {
   panic("Does not fit into Memory")
}

Let the compiler recognize runtime.mul as an intrinsic and replace it with optimized inline instructions where possible.

On architectures that provide overflow/carryflags this can be a mul and setting overflow according to the flag. Otherwise use the provided generic implementation. (Note that a*b is always returned also in the generic implementation to match optimizations in machine code.)

Thereby we can eliminate maxSliceCap and shrink the switch cases without loss of (much) performance but better branch prediction and cache use and shrinking the functions away from special cases again.

Possible future directions and ideas for later extra proposals

  • add a similar function for add (the name is already taken in the runtime namespace) or just optimize overflow checks for adds better if this is not already the case
  • if runtime.mul provides nice code and performance improvements after evaluation of the implementation and use cases we could add such a function to math(.Bits) (and just let the compiler do the existing optimization to that name). Note that math.Bits are already intrinsics.
  • add detection of generic overflow checking in the ssa backend such that any overflow checking code is optimized

If the introduction of the above function and optimization looks ok i would like to work on implementing it and then replace existing runtime codepaths that would benefit from the new function.

/cc @khr @josharian @ianlancetaylor @stemar94

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 1
  • Comments: 15 (8 by maintainers)

Commits related to this issue

Most upvoted comments

Seems unobjectionable to add such a runtime function, although it should probably go in runtime/internal/sys, with the other runtime intrinsics.

If we decide to add something like it to math/bits as well, we might decide to give the runtime flavor looser semantics for performance (like we have done with other runtime intrinsics that also appear in math/bits). FWIW, I recently wanted math/bits to have something like Mul128(x, y int64) (lo, hi int64) for use with math/rand.

Longer term, though, it seems perhaps better to fix this either via:

  • adding [u]int128 to the language (happy to write more about this if you want)
  • adding arbitrary precision integers to the language (#19623)

Note that with either of these, we should be able to get equivalent performance, with enough compiler help. (1) In the arbitrary precision case, the compiler could in theory recognize that the integer is in use only in one expression, which bounds its range, and replace it under the hood with a 128 bit version. (2) Working with a 128 bit multiply, the compiler can recognize that the top half of a multiply is being checked against 0, so it can substitute a pure overflow check there.