wasmtime: idea: "Multi-return" functions for faster Rust `Result` and/or simpler exceptions

From the paper “Multi-return Function Call” (http://www.ccs.neu.edu/home/shivers/papers/mrlc-jfp.pdf). The basic idea from the perspective of compiled code is to include multiple return pointers in a stack frame so functions can return to different places.

Compared to `Result<T,E>`

This is denotationally the same as return a value of a Rust enum with 1 variant according to each of the return pointer slots, with fields according to the returned data associated with that slot (registers, spilled stack slots, etc). But with the naive enum calling convention of adding a tag field, the caller needs to branch on the tag field, even if the enum value was just created before the return so nothing is in principle unpredictable. In the common case of a function “rethrowing” a Err, (Err(e0) => ... return Err(e1) ... math arm), the native way results results on O(n) branches (one per stack frame) one each of the Err tags, while this way allows the error return pointer to point to disjoint control flow for the failure case, catching and rethrowing without additional branches, so the only 1 branch is the original failure condition.

Compared to unwinding

Success and failure control flow is implemented identically, avoiding significant effort on the part of compiler writers in maintaining a completely separate implementation concepts while optimizations can work with both, and don’t get stuck on the success failure boundary. At run time, the lack of any DWARF-like interpreters reduces dependencies and simplifies things too.

In short, we have the asymptotic efficiency of unwinding with the implementation (compiler and run-time) efficiency of enum return.

I talked to @eddyb about this once and he said to talk to someone on #cranelift, but alas I am on IRC less these days and I forgot their nick. Opening this to make sure the idea isn’t lost completely due to my negligence. Cheers.

About this issue

Original URL
State: open
Created 6 years ago
Reactions: 9
Comments: 28 (9 by maintainers)

Links to this issue

What is wrong with Ok(match thing { ... }) ?

Most upvoted comments

This paper does some measurements on a bunch of different call stack implementations, including a small section on the impact of (not) using native call/ret instructions: https://kavon.farvard.in/papers/pldi20-stacks.pdf

rpjohnst on Mar 22, 2020

Current discussion in other contexts has reminded me of this issue and I’d like to note that the “dynamically select between multiple return addresses to jump to” strategy has a serious performance issue that (as far as I can tell) wasn’t mentioned in this thread and in the paper linked by @Ericson2314 at the start.

Modern CPUs use a specialized kind of branch prediction named return address stack (RAS) that exploit the usually strict pairing of call and return instructions to cheaply and very accurately predict return addresses. As far as I know, this feature is ubiquituous not just among x86 processors (where it’s been everywhere since Pentium 1) but more generally among processors with any sort of dynamic branch prediction.

On any CPU with a RAS, returning to a different address than the one right after the corresponding call will always cause a misprediction (with all the costs that entails), assuming the “unusual return” is recognized as a return instruction by the CPU. If it’s not recognized as such, that’s even worse, because now there’s a superfluous return address on the RAS (pushed by the call and then not popped by the non-standard return). This can cause multiple subsequent returns to be mispredicted.

So while this strategy is elegant at the language and ISA level, I don’t think microarchitectures will enjoy it very much. It seems likely to me that these mispredictions will usually dwarf the cost of checking a tag and branching accordingly after each ordinary return, even taking into account that this will be replicated at each level of the call stack instead of jumping straight to the code that actually handles it (which I believe is rarely very far up the stack in practice, among other reasons due to inlining and drop glue that has to run).

hanna-kruppe on Jul 21, 2019

That cost might be reduced by replacing the corresponding call with a push; jmp- no effect on the RAS; instead just a normal indirect branch in the callee. The return+branch may be faster still but it’s not black and white.

rpjohnst on Jul 21, 2019

For actual unwinding, Rust’s hands are tied in practice by the practical need to interoperate with existing C/C++ ABIs.

Note that if a foreign extern "C" { } (or extern { }) function declaration unwinds the behavior is undefined. Also, an uncaught panic in extern "C" fn aborts the process. That is, an implementation can assume that C foreign functions and extern "C" functions do not unwind.

Being able to interoperate with C++ ABIs that unwind would require extern "c++", and handling those ABIs will probably require inserting shims that translate Rust panics to/from C++ exceptions for the particular target (unless the extern "c++" fn or the extern "c++" { } declaration are explicitly specified to never unwind).

gnzlbg on Mar 14, 2019