tensorflow: nsync ~per_thread() issue causing SIGSEGV in glibc __run_exit_handlers exit.c

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): I have linked against libtensorflow_cc.so but have used static linking
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 6.10 build / CentOS 7.4 runtime
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): v1.12.0
TensorFlow version (use command below):
Python version: NA
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source): gcc 4.8.5
CUDA/cuDNN version: 10.0.130,7.4.2.24
GPU model and memory:GTX 1060

Describe the current behavior Segfault at exit when unloading the Tensorflow Plugin in Autodesk Flame 2020.0

Error message

Program received signal SIGSEGV, Segmentation fault.
0x00007fa78f98adc0 in ?? ()

Stacktrace

(gdb) bt
#0  0x00007fa78f98adc0 in  ()
#1  0x00007faba74ceb19 in  () at /lib64/libstdc++.so.6
#2  0x00007faba6bc5b69 in __run_exit_handlers (status=0, listp=0x7faba6f526c8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:77
#3  0x00007faba6bc5bb7 in __GI_exit (status=<optimized out>) at exit.c:99
#4  0x000000000218b5e9 in  ()
#5  0x0000000000703ce9 in  ()
#6  0x000000000218a5ee in  ()
#7  0x000000000218a6f1 in  ()
#8  0x0000000000704358 in  ()
#9  0x00000000004d9eb5 in  ()
#10 0x00007faba6bae3d5 in __libc_start_main (main=
    0x4d72c0, argc=1, argv=0x7ffcb9bb5dd8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffcb9bb5dc8)
    at ../csu/libc-start.c:266
#11 0x00000000005c6ee1 in  ()

Symbol at 0x00007fa78f98adc0 <_ZN5nsync12_GLOBAL__N_110per_threadD2Ev> c++filt _ZN5nsync12_GLOBAL__N_110per_threadD2Ev

nsync::(anonymous namespace)::per_thread::~per_thread()

see: https://github.com/google/nsync/blob/5e8b19a81e5729922629dd505daa651f6ffdf107/platform/c%2B%2B11/src/per_thread_waiter.cc#L31

Describe the expected behavior Close down cleanly.

Code to reproduce the issue Not possible, as Autodesk Flame framework is required

Other info / logs Looking at https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/tools/benchmark/benchmark_model.cc

There doesn’t seem to be any special destructors used.

Closes down cleanly in other runtime environments.

The Plugin can be in any state when exiting, but other destructors can be called earlier to clean up the session.

What destructors would have to be used to make sure that the nsync::(anonymous_namespace)::per_thread::~per_thread() desctructor cannot result in the segfault with the atexit handlers from glibc?

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 20 (1 by maintainers)

Commits related to this issue

If possible, avoid C++11 thread_local for TSD, even in C++11 builds. And when it is unavaoidable, attempt to handle the possibility of C++11's per_thread_waiter destructor being called multiple times ... — committed to google/nsync by m3bm3b 5 years ago
Update TensorFLow to use nsync version 1.22.0 Changes include: - slightly faster condition variables on the Mac https://github.com/google/nsync/releases/tag/1.21.0 - a fix for crashes in a C++11 t... — committed to tensorflow/tensorflow by tensorflower-gardener 5 years ago

Most upvoted comments

BOOM!

[Inferior 1 (process 7903) exited normally]

This issue has been hanging around my neck like a lead weight for at least six weeks thanks for your immediate solution.

samhodge on Aug 4, 2019

How frustrating.

A possilble workaround for the moment is to call quick_exit() instead of exit().

I believe a fix for nsync will be to avoid using the C++ per_thread_waiter machinery (per_thread_waiter.cc) and instead use the Posix machinery (per_thread_waiter.c) except on platforms where the latter is unavailable (which is just Windows, as far as I know). If building with bazel, that could be achieved by changing this line in the BUILD file “//conditions:default”: [“platform/c++11/src/per_thread_waiter.cc”], to “//conditions:default”: [“platform/posix/src/per_thread_waiter.c”],

Would you be able to test these possibilities?
WIll one of them work for you temporarily, until I can implement the second fix in nsync?

If this is caused by libraries being pulled into the address space multiple times, we will have to be careful with other libraries too.

By the way, it’s always risky to call exit() in a multithreaded C++ programme; it’s safer to call quick_exit(). That’s because invoking the destructor of any static variable could cause another running thread to crash, as it may be about to access that variable. You can’t get around this safely by forcibly suspending all the threads first, because that could cause the destructors to deadlock the exiting thread.
And you can’t shut down all the threads cleanly, first because you can’t even name them, and second because there may not be a deadlock-free shutdown order in general. That’s why quick_exit() exists. Alas, quick_exit() was defined not only to avoid running static destructors, but also to fail to run explicit atexit() handlers, so you may have to flush stdout/stderr before calling quick_exit().

m3bm3b on Aug 3, 2019