tensorflow: TensorFlow stopped working with custom ops built with GCC5

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): tf-nightly >= 20190321
  • Python version: Python 2, Python 3
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source): 5.4.0-6ubuntu1~16.04.11
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”

Describe the current behavior TensorFlow fails with segmentation fault when using custom ops built with gcc5. The segmentation fault originates from usage of std::function on the interface boundary, introduced in https://github.com/tensorflow/tensorflow/commit/41e7b3ca0abfd7e40a82ae38f96a9fbfeecb0ee5.

Describe the expected behavior Custom ops built with gcc5 should continue working with TF built with gcc4.

Code to reproduce the issue

$ docker run -it tensorflow/tensorflow:nightly
# apt install -y mpich
# pip install horovod
# cat > test.py
import tensorflow as tf
import horovod.tensorflow as hvd
hvd.init()
sess = tf.Session()
sess.run(hvd.allreduce(tf.constant(1.0)))
^D
# python test.py

Outputs:

root@011ee61f092e:/# python test.py
2019-03-24 00:25:51.898835: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-24 00:25:51.910390: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-03-24 00:25:51.914924: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4269380 executing computations on platform Host. Devices:
2019-03-24 00:25:51.914981: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
Segmentation fault (core dumped)
root@011ee61f092e:/#

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The reason for this issue is the fact that definition of std::function has changed between gcc4 and gcc5.

gcc4: _M_invoke(const _Any_data& __functor, _ArgTypes... __args) gcc5: _M_invoke(const _Any_data& __functor, _ArgTypes&&... __args)

While this change is ABI-compatible, it produced segfault in situation where gcc4-compiled code is calling function defined in gcc5-compiled plugin.

Short repro:

Prepare files:

$ cat > std_function_fw.h
#include <functional>
int call_me(std::function<int(int)> f);
^D
$ cat > std_function_fw.cc
#include "std_function_fw.h"
int call_me(std::function<int(int)> f) {
	return f(42);
}
^D
$ cat > std_function_client.cc
#include <iostream>
#include "std_function_fw.h"
int main(int argc, char **argv) {
	std::cout << call_me([](int val) { return val + 201808; }) << std::endl;
}
^D

Mount to gcc4 docker (e.g. debian:jessie):

# g++ --std=c++11 -fPIC std_function_fw.cc -shared -o libstd_function_fw.so
# g++ --std=c++11 -fPIC std_function_client.cc -o std_function_client -lstd_function_fw -L.
# LD_LIBRARY_PATH=. ./std_function_client
<will work>

Mount to gcc5 docker (e.g. ubuntu:16.04), keep libstd_function_fw.so:

# g++ --std=c++11 -fPIC std_function_client.cc -o std_function_client -lstd_function_fw -L.
# LD_LIBRARY_PATH=. ./std_function_client
<will crash>

Proposed solution is to revert https://github.com/tensorflow/tensorflow/commit/41e7b3ca0abfd7e40a82ae38f96a9fbfeecb0ee5 and keep using function pointers on the plugin interface boundary w/o using std::function.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 22 (20 by maintainers)

Commits related to this issue

Most upvoted comments

C API sounds great! I’d love not to have to worry about C++ ABI compatibility anymore. Happy to be one of the early adopters.

Is there any update on whether reverting https://github.com/tensorflow/tensorflow/commit/41e7b3ca0abfd7e40a82ae38f96a9fbfeecb0ee would impede the progress of that work? While I understand that C++ ABI has little theoretical guarantees across compiler versions, in practice this compatibility has been working fine for the last two years for a vast majority of users. _GLIBCXX_USE_CXX11_ABI does a pretty good job with allowing backward compatibility, and std::function is one of the outliers, rather than a common pattern.

Asking users to compile plugins like Horovod in a custom-op container is a pretty bad user experience because Horovod requires environment such as CUDA, NCCL, and MPI to have the same version as what’d be used on the host.

Because of that, it’d be great if we could keep using existing C++ API with the revert until we can cut over to C API.

We can consider reverting, but if you use std::string you just wont be able to compile an add on to TF using GCC 5 or more recent. Our custom-op guide provides a docker container which guarantees that if you use that to compile your add ons, the compiled objects will work.

Considering @sjamesr’s work is to get us to a fully ABI compatible state, I will check with him before rolling back, to see how important this change in his greater task.