bcc: Userspace stack is not unwinded in most samples with offcputime.py

Hi,

I’m experimenting with tracing postgres using offcputime.py. However, userspace stack is not properly unwinded in most samples as shown in attached flamegraph [1], making the tool basically unusable. Postgres is compiled with -fno-omit-frame-pointer (and this is confirmed a bunch of correctly unwinded stacks). I thought that one reason for this might be glibc compiled with -omit-frame-pointer, because the last resolved symbol is epoll_pwait from libc.so. I took musl libc and compiled it with -fno-omit-frame-pointer, but the result is still the same. What can be other reasons for this and is there anything I can do about it?

Kernel version is 4.14.13.

[1] https://oc.postgrespro.ru/index.php/s/qDINPrRnlNqxXBd

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 22 (3 by maintainers)

Commits related to this issue

Most upvoted comments

@arssher , through some debugging. I am able to root cause the issue. The file is arch/x86/entry/entry_64.S. A link to latest stable 4.13 code https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/arch/x86/entry/entry_64.S?h=linux-4.13.y#n177 which is very similar to 4.14.13 code.

By studying the code, I found one workaround, enable kernel tracepoint raw_syscalls:sys_enter or raw_syscalls:sys_exit.

[root@localhost sys_enter]# pwd
/sys/kernel/debug/tracing/events/raw_syscalls/sys_enter
[root@localhost sys_enter]# echo 1 > enable

This will add TIF_SYSCALL_TRACEPOINT to every task in the system and effectively force slowpath for syscall processing in entry_64.S. I did not find an easy workaround than this yet. This does have overhead as it will flood the debugfs trace_pipe.

The issue has been fixed in upcoming 4.16 release and backported to stable release 4.14 and 4.15. So if you upgrade your kernel version from 4.14.13 to some later 4.14.x version, the problem will get fixed.

At the same time, I am studying whether we have easy way to fix the older kernels (4.13 or 4.9), or we could request the same patch (as in 4.14/4.15 stable release) back ported to 4.13/4.9.

The issue @palmtenor pointed to is actually a real issue since in offcputime.py, pid and user stack id is recorded in the same key.

struct key_t {
    u32 pid;
    u32 tgid;
    int user_stack_id;
    int kernel_stack_id;
    char name[TASK_COMM_LEN];
};

For example, initially, <pid=100, user_stack_id=10> is recorded, and later on user_stack_id=10 is reused and <pid=200, user_stack_id=10> is recorded. Later on, <pid=100, user_stack_id=10> will be incorrect. One way COULD be to prevent the second <pid=200, user_stack_id=10> by not allowing reuse. and warn user to increase the stack size to avoid collision…

Can you try the following patch:

diff --git a/tools/offcputime.py b/tools/offcputime.py
index ad540ee..264e4fe 100755
--- a/tools/offcputime.py
+++ b/tools/offcputime.py
@@ -199,9 +199,9 @@ bpf_text = bpf_text.replace('MINBLOCK_US_VALUE', str(args.min_block_time))
 bpf_text = bpf_text.replace('MAXBLOCK_US_VALUE', str(args.max_block_time))

 # handle stack args
-kernel_stack_get = "stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID)"
+kernel_stack_get = "stack_traces.get_stackid(ctx, 0)"
 user_stack_get = \
-    "stack_traces.get_stackid(ctx, BPF_F_REUSE_STACKID | BPF_F_USER_STACK)"
+    "stack_traces.get_stackid(ctx, BPF_F_USER_STACK)"
 stack_context = ""
 if args.user_stacks_only:
     stack_context = "user"

Basically, disable BPF_F_REUSE_STACKID in offcputime.py?