wasmtime: idea: "Multi-return" functions for faster Rust `Result` and/or simpler exceptions
From the paper “Multi-return Function Call” (http://www.ccs.neu.edu/home/shivers/papers/mrlc-jfp.pdf). The basic idea from the perspective of compiled code is to include multiple return pointers in a stack frame so functions can return to different places.
Compared to Result<T,E>
This is denotationally the same as return a value of a Rust enum with 1 variant according to each of the return pointer slots, with fields according to the returned data associated with that slot (registers, spilled stack slots, etc). But with the naive enum calling convention of adding a tag field, the caller needs to branch on the tag field, even if the enum value was just created before the return so nothing is in principle unpredictable. In the common case of a function “rethrowing” a Err, (Err(e0) => ... return Err(e1) ... math arm), the native way results results on O(n) branches (one per stack frame) one each of the Err tags, while this way allows the error return pointer to point to disjoint control flow for the failure case, catching and rethrowing without additional branches, so the only 1 branch is the original failure condition.
Compared to unwinding
Success and failure control flow is implemented identically, avoiding significant effort on the part of compiler writers in maintaining a completely separate implementation concepts while optimizations can work with both, and don’t get stuck on the success failure boundary. At run time, the lack of any DWARF-like interpreters reduces dependencies and simplifies things too.
In short, we have the asymptotic efficiency of unwinding with the implementation (compiler and run-time) efficiency of enum return.
I talked to @eddyb about this once and he said to talk to someone on #cranelift, but alas I am on IRC less these days and I forgot their nick. Opening this to make sure the idea isn’t lost completely due to my negligence. Cheers.
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 9
- Comments: 28 (9 by maintainers)
This paper does some measurements on a bunch of different call stack implementations, including a small section on the impact of (not) using native call/ret instructions: https://kavon.farvard.in/papers/pldi20-stacks.pdf
Current discussion in other contexts has reminded me of this issue and I’d like to note that the “dynamically select between multiple return addresses to jump to” strategy has a serious performance issue that (as far as I can tell) wasn’t mentioned in this thread and in the paper linked by @Ericson2314 at the start.
Modern CPUs use a specialized kind of branch prediction named return address stack (RAS) that exploit the usually strict pairing of call and return instructions to cheaply and very accurately predict return addresses. As far as I know, this feature is ubiquituous not just among x86 processors (where it’s been everywhere since Pentium 1) but more generally among processors with any sort of dynamic branch prediction.
On any CPU with a RAS, returning to a different address than the one right after the corresponding
callwill always cause a misprediction (with all the costs that entails), assuming the “unusual return” is recognized as a return instruction by the CPU. If it’s not recognized as such, that’s even worse, because now there’s a superfluous return address on the RAS (pushed by thecalland then not popped by the non-standard return). This can cause multiple subsequent returns to be mispredicted.So while this strategy is elegant at the language and ISA level, I don’t think microarchitectures will enjoy it very much. It seems likely to me that these mispredictions will usually dwarf the cost of checking a tag and branching accordingly after each ordinary return, even taking into account that this will be replicated at each level of the call stack instead of jumping straight to the code that actually handles it (which I believe is rarely very far up the stack in practice, among other reasons due to inlining and drop glue that has to run).
That cost might be reduced by replacing the corresponding
callwith apush; jmp- no effect on the RAS; instead just a normal indirect branch in the callee. The return+branch may be faster still but it’s not black and white.Note that if a foreign
extern "C" { }(orextern { }) function declaration unwinds the behavior is undefined. Also, an uncaught panic inextern "C" fnaborts the process. That is, an implementation can assume that C foreign functions andextern "C"functions do not unwind.Being able to interoperate with C++ ABIs that unwind would require
extern "c++", and handling those ABIs will probably require inserting shims that translate Rust panics to/from C++ exceptions for the particular target (unless theextern "c++" fnor theextern "c++" { }declaration are explicitly specified to never unwind).