pprof-rs: Panic on aarch64

tikv/tikv#10658

TiKV on HUAWEI,Kunpeng 920 failed to profile and got an panic.

#0  0x0000fffd7b6aceb4 in ?? () from /lib64/libgcc_s.so.1
#1  0x0000fffd7b6ae534 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x0000aaac01eedb58 in backtrace::backtrace::libunwind::trace (cb=...) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.37/src/backtrace/libunwind.rs:88
#3  backtrace::backtrace::trace_unsynchronized (cb=...) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.37/src/backtrace/mod.rs:66
#4  pprof::profiler::perf_signal_handler (_signal=<optimized out>) at /root/.cargo/registry/src/github.com-1ecc6299db9ec823/pprof-0.4.2/src/profiler.rs:128
#5  <signal handler called>

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 46 (37 by maintainers)

Most upvoted comments

There is some confusion in various comments on this issue so I’ll try to clear up some of it.

The difference between stack walking and symbolication, and how it relates to inline functions

  1. Stack walking is how you get from register values + raw stack memory to a list of code addresses.
  2. Symbolication is how you get from the list of code addresses to a list of function names.

The resolution of inline functions happens during symbolication: A single address can resolve to one or more functions. If the address is inside code which the compiler inlined into another function, then you get both the inlined function name (or even multiple inline function names, if the compiler inlined multiple levels deep), and then the outer function name. Whether you get inline functions is completely independent from how you walk the stack.

If you want to do stack walking “offline”, you need to capture the entire stack bytes. If you want to do inline function resolution offline, you only need to capture the code addresses on the stack (instruction pointer + return addresses) and enough information to be able to match an address to the library it was in.

A slightly confusing part here is that both stack walking and symbolication can make use of DWARF information. However, they’re different subsets of DWARF information. DWARF stack walking information is stored in the eh_frame or debug_frame sections. DWARF symbolication information is stored in a other sections which start with debug_, but not in debug_frame.

Unwinding / stack walking on macOS

On macOS x86_64 and arm64, all system libraries are compiled with frame pointers enabled. And having framepointers enabled is also the default for clang and Xcode, unless you manually set -fomit-frame-pointer. So frame pointer stack walking mostly works fine, unless you’re profiling a program that has been compiled with -fomit-frame-pointer. There is one exception: On arm64, leaf functions don’t have frame pointers even if you compile with frame pointers enabled. This means that on arm64, if you just use frame pointer unwinding, you will be missing the second frame in the stack if you’re currently inside the leaf function: The first frame will be correct (it comes straight from the instruction pointer), the immediate caller is missing, and the frame pointer gives you the caller’s caller, i.e. the third frame. From that point on the rest of stack unwinding works fine.

To unwind leaf functions correctly, you need to look at the compact unwind info in the __unwind_info section.

Unwind information sections

On macOS, most binaries will have an __unwind_info section, and some will also have an __eh_frame section. For complete unwinding, you need both: __unwind_info covers the simple cases, and __eh_frame covers the hard cases.

Here’s how it breaks down.

On x86_64:

  • Frame pointer unwinding works well, except for binaries without frame pointers.
  • For binaries without frame pointers, __unwind_info covers 99% of the functions.
  • __eh_frame is needed to correctly unwind the remaining 1% of cases.

On arm64:

  • Frame pointer unwinding works ok but has the missing second frame for leaf functions.
  • __unwind_info lets you unwind leaf functions correctly, but still requires frame pointers for non-leaf functions.
  • __eh_frame lets you unwind if you don’t have frame pointers.

I’ve written a crate called framehop which is a pure Rust implementation of everything you need to get reliable and correct unwinding on macOS. It doesn’t have much documentation yet though. Edit: Documentation is in place now.

Right~ This also seems to be possible by modifying nongnu/libunwind (by converting ucontext to unw_context_t instead of calling unw_getcontext() in the signal handler).

The nongnu/libunwind announces that it supports unwind from a signal handler (and accept an argument to tell it whether we are in a signal handler), though we haven’t fully tested it yet.

So I reimplemented a minimal subset of libunwind, to try to avoid the problems you said.

It’s always glad to see another implementation of libunwind. Hope it can be better than any existing one (as we have suffered a lot with these 😭 )

Thanks for @SchrodingerZhu . I have tried to use nongnu libunwind and the unw_xxx API to get the backtrace in arm. It works surprisingly well!

The modification on backtrace-rs can be found in https://github.com/YangKeao/backtrace-rs/commit/ee71341ca6a1ea68e2c60677a4d9b31f0042378b

@sticnarf @mornyx Building nongnu-libunwind without enable-cxx-exceptions will not build the function _Unwind_XXX (e.g. _Unwind_Backtrace), which is also the default behavior of the nongnu-libunwind shipped by ubuntu. I think it’s fine to directly static-link with it.

The libunwind-sys needs to be modified to support static link. I will try to modify it and submit a PR.

These functions leave x29 (“fp”) unchanged. And because they don’t call any other functions, the lr register also stays unchanged. So unwinding from these functions only means “get the return address from lr and leave all other registers unchanged”.

I did a test and it does exactly what @mstange says, here is a simple demo:

int func1() {
    int x = 1;
    int y = 2;
    return x + y;
}

int func2() {
    return func1() + 1;
}

int func3() {
    return func2() + 1;
}

On my macOS, the ARM64 asm code generated by clang is as follows:

	.section	__TEXT,__text,regular,pure_instructions
	.build_version macos, 12, 0	sdk_version 12, 3
	.globl	_func1                          ; -- Begin function func1
	.p2align	2
_func1:                                 ; @func1
	.cfi_startproc
; %bb.0:
	sub	sp, sp, #16
	.cfi_def_cfa_offset 16
	mov	w8, #1
	str	w8, [sp, #12]
	mov	w8, #2
	str	w8, [sp, #8]
	ldr	w8, [sp, #12]
	ldr	w9, [sp, #8]
	add	w0, w8, w9
	add	sp, sp, #16
	ret
	.cfi_endproc
                                        ; -- End function
	.globl	_func2                          ; -- Begin function func2
	.p2align	2
_func2:                                 ; @func2
	.cfi_startproc
; %bb.0:
	stp	x29, x30, [sp, #-16]!           ; 16-byte Folded Spill
	mov	x29, sp
	.cfi_def_cfa w29, 16
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	bl	_func1
	add	w0, w0, #1
	ldp	x29, x30, [sp], #16             ; 16-byte Folded Reload
	ret
	.cfi_endproc
                                        ; -- End function
	.globl	_func3                          ; -- Begin function func3
	.p2align	2
_func3:                                 ; @func3
	.cfi_startproc
; %bb.0:
	stp	x29, x30, [sp, #-16]!           ; 16-byte Folded Spill
	mov	x29, sp
	.cfi_def_cfa w29, 16
	.cfi_offset w30, -8
	.cfi_offset w29, -16
	bl	_func2
	add	w0, w0, #1
	ldp	x29, x30, [sp], #16             ; 16-byte Folded Reload
	ret
	.cfi_endproc
                                        ; -- End function
.subsections_via_symbols

Assuming we are executing func1 at this time, when we start backtracking directly in func1, the starting point is the IP register in the scope of func1. But since func1 does not save the FP register (x29), the Frame Pointer still points to the frame start address of func2, and the frame start address of func2 stores the return address within the range of func3. So from func1 will directly backtrack to func3.

But in a regular backtracking scenario, this problem can be easily avoided. We usually wrap the stack backtrace as a function like backtrace(), and in the implementation of backtrace(), we call a function like getcontext() to initialize the register context. So with getcontext() as a leaf function, we skip the backtrace() function when backtracking, which is exactly what we want.

The only thing that needs to be done is to ensure that the backtrace() and getcontext() functions are not inlined.

The rust demo below proves this conclusion:

use unwind::{unwind_init_registers, Registers};

#[inline(never)]
fn main() {
    func1();
}

#[inline(never)]
fn func1() -> i32 {
    func2() + 1
}

#[inline(never)]
fn func2() -> i32 {
    func3() + 1
}

#[inline(never)]
fn func3() -> i32 {
    let x = 1;
    let y = 2;
    backtrace();
    x + y
}

#[inline(never)]
fn backtrace() {
    let mut registers = Registers::default();
    unsafe {
        // Similar to `unw_getcontext()`.
        unwind_init_registers(&mut registers as _);
    }
    // Do stack backtrace.
    while registers[29] != 0 {
        let pc = load::<u64>(registers[29] + 8); // x29 + 8 points to `Return Address`
        registers[29] = load::<u64>(registers[29]);

        // Show function name.
        println!("{:#x}", pc);
        backtrace::resolve(pc as _, |s| {
            println!("    {:?}", s.name());
        });
    }
}

#[inline]
fn load<T: Copy>(address: u64) -> T {
    unsafe { *(address as *const T) }
}

The output is (on my ARM64 macOS):

0x100755b38
    Some(demo::func3::h552eb43de9d1600a)
0x100755acc
    Some(demo::func2::hee00c49c139ef337)
0x100755a70
    Some(demo::func1::hc597aa0db27740e8)
0x100755a54
    Some(demo::main::h60f61d40767d4a8f)
0x100756a84
    Some(core::ops::function::FnOnce::call_once::h3e2c7a62c3d8b7b0)
0x100756db8
    Some(std::sys_common::backtrace::__rust_begin_short_backtrace::h344b61623a7d58a2)
0x100756d70
    Some(std::rt::lang_start::{{closure}}::h16ad8343d4bef89d)
0x10082466c
    Some(core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h8eb3ac20f80eabfa)
    Some(std::panicking::try::do_call::ha6ddf2c638427188)
    Some(std::panicking::try::hda8741de507c1ad0)
    Some(std::panic::catch_unwind::h82424a01f258bd39)
    Some(std::rt::lang_start_internal::{{closure}}::h67e296ed5b030b7b)
    Some(std::panicking::try::do_call::hd3dd7e7e10f6424e)
    Some(std::panicking::try::ha0a7bd8122e3fb7c)
    Some(std::panic::catch_unwind::h809b0e1092e9475d)
    Some(std::rt::lang_start_internal::h358b6d58e23c88c7)
0x100756d38
    Some(std::rt::lang_start::h74b9170615de9a35)
0x100755d44
    Some("_main")
0x100cb5088
0xb11b000000000000

This is exactly what we expected.

However, when the scene comes to CPU Profiling, if the leaf function is interrupted by the SIGPROF, I think its parent function will indeed be skipped. This can be avoided by setting -mno-omit-leaf-frame-pointer.

This means that on arm64, if you just use frame pointer unwinding, you will be missing the second frame in the stack if you’re currently inside the leaf function: The first frame will be correct (it comes straight from the instruction pointer), the immediate caller is missing, and the frame pointer gives you the caller’s caller, i.e. the third frame.

For non-leaf frame pointer, will the frame pointer register (rbp or x29) be modified by the leaf function?

Just to clarify: I am talking about the subset of functions which, on arm64, will not create a “frame record” for themselves even if you compile with frame pointers enabled. This happens for functions which do not call other functions (i.e. which are “leaf” functions) and which also don’t need to save and restore any registers.

These functions leave x29 (“fp”) unchanged. And because they don’t call any other functions, the lr register also stays unchanged. So unwinding from these functions only means “get the return address from lr and leave all other registers unchanged”.

If the leaf function stores 0x0 in the frame pointer register, we will have no way to get the former frame. Am I correct?

If it did that then yes, frame pointer unwinding would not work at all and there would be no way to get the rest of the stack. You would need to use some kind of unwind information to recover a usable frame pointer value. But luckily these functions leave the frame pointer from the parent frame intact.

(Or maybe I misunderstood). As shown in #116 (comment) , without the force-frame-pointer enabled, the stack is really short, which means we lost not only the caller’s frame.

Yes, in code compiled with -fomit-frame-pointer, the easy solution fails. Luckily macOS has established a culture of always enabling frame pointers.

I will try https://github.com/mstange/framehop/ these days 😄 , as #116 has already gave us a chance to provide more options of unwinding method for the user.

Great, please file issues if you run into any trouble. Framehop only solves a subset of the problem; it’s still up to the user to find where libraries are mapped in memory, to get their unwind section data, and to read the stack memory in a way that doesn’t cause segfaults. Framehop is mostly about speed, caching, handling multiple types of unwind data, and supporting the offline use case on different machines and architectures.

But this only solved half the problem for us (as you said, and also doesn’t support macOS). So I reimplemented a minimal subset of libunwind, to try to avoid the problems you said.

Of course, it also needs to be fully tested to verify that it works correctly…

Don’t worry. We have testing environments to run different complicated user payloads. It can be pretty helpful to discover problems and make it close to “battle-tested”.

@mornyx Sorry, I don’t understand how you implement it. It sounds really magic. I don’t think it’s possible to do offline dwarf unwind without saving the stack (e.g. the perf actually unwind offline, but it needs to copy and save the full current stack).

As I know, the dwarf can only help you to get the address of the callee-saved register on the stack, while running, the stack changed, and I don’t know how you can find the remains of previous running.

Oops! I understand! You were using the ucontext as the startpoint to only unwind the stack before the signal frame? Really cool 🍻 ! But I guess the problem isn’t only because of the signal frame (actually, the x86_64 unwind implementation can all handle the signal frame). The bigger problem is that the unwind implementation may call some functions which are not allowed inside a signal handler, and will cause wired behavior.

Right~ This also seems to be possible by modifying nongnu/libunwind (by converting ucontext to unw_context_t instead of calling unw_getcontext() in the signal handler).

But this only solved half the problem for us (as you said, and also doesn’t support macOS). So I reimplemented a minimal subset of libunwind, to try to avoid the problems you said.

I suppose frame pointers cannot be used for inlined functions while dwarf section can still be used locate inline functions.

I tested it on newer versions of macOS, and in fact the binary only contains the __unwind_info segment, no longer the __eh_frame segment. The point is that __unwind_info does not contain inline information either. However backtrace::resolve() can correctly call callback for inlined functions.

The demo below confirms this point:

use unwind::{init_unwind_context, UnwindContext};

fn main() {
    func1_inlined();
}

#[inline(always)]
fn func1_inlined() {
    func2();
}

fn func2() {
    unsafe {
        let mut context = UnwindContext::default();
        init_unwind_context(&mut context as _);
        show(context.pc);
        // jump to next frame
        context.pc = *std::mem::transmute::<_, *const u64>(context.fp + 8);
        context.fp = *std::mem::transmute::<_, *const u64>(context.fp);
        show(context.pc);
    }
}

unsafe fn show(pc: u64) {
    println!("{:#x}", pc);
    backtrace::resolve(std::mem::transmute(pc), |s| {
        println!("{:?}", s.name());
    });
}

output:

0x100a82ffc
Some(demo::func2::ha91bc1d34ffee8cf)
0x100a82fd4
Some(demo::func1_inlined::h45123ba64dc5a176)
Some(demo::main::h9c57e81ec6be060b)

Looking at 0x100a82fd4, we got two functions for one pointer, including the one which is inlined.