tensorflow: Attempting to use the CPU Work Sharder segfaults on g++ 5.4.0
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
I’ve adapted the ZeroOut operator from the Adding a New Op example.
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 16.04
- TensorFlow installed from (source or binary):
binary GPU 1.3.0
- TensorFlow version (use command below):
$ python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)” (‘v1.3.0-rc2-20-g0787eee’, ‘1.3.0’)
- Python version:
2.7.12
- Bazel version (if compiling from source):
N/A
- CUDA/cuDNN version:
N/A
- GPU model and memory:
N/A
- Exact command to reproduce:
Test operator: shard_fails.zip
$ make
$ python test_op.py
Describe the problem
When the above C++ operator runs, it’ll print the number of threads in the pool (8) and then segfault on the Shard call.
Source code / logs
C++ operator code:
#define EIGEN_USE_THREADS
#include "tensorflow/core/lib/core/threadpool.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/shape_inference.h"
#include "tensorflow/core/util/work_sharder.h"
using namespace tensorflow;
REGISTER_OP("ZeroOut")
.Input("to_zero: int32")
.Output("zeroed: int32")
.SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
c->set_output(0, c->input(0));
return Status::OK();
});
class ZeroOutOp : public OpKernel {
public:
explicit ZeroOutOp(OpKernelConstruction* context) : OpKernel(context) {}
void Compute(OpKernelContext* context) override {
// Grab the input tensor
const Tensor& input_tensor = context->input(0);
auto input = input_tensor.flat<int32>();
// Create an output tensor
Tensor* output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(),
&output_tensor));
auto output_flat = output_tensor->flat<int32>();
// Set all but the first element of the output tensor to 0.
const int N = input.size();
auto pool = context->device()->tensorflow_cpu_worker_threads()->workers;
printf("Pool Threads %d\n", pool->NumThreads());
Shard(pool->NumThreads(), pool, N, 10, [&](int64 start, int64 end) {
for(int64 i=start; i<end; ++i)
{ output_flat(i) = 0; }
});
if(N > 0)
{ output_flat(0) = input(0); }
}
};
REGISTER_KERNEL_BUILDER(Name("ZeroOut").Device(DEVICE_CPU), ZeroOutOp);
See below the gdb trace:
Core was generated by `python test_op.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 std::_Function_handler<void (long long, long long), ZeroOutOp::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#1}>::_M_invoke(std::_Any_data const&, long long&&, std::_Any_data const&) (__functor=..., __args#0=<unknown type in tfop.so, CU 0x0, DIE 0x41c73>, __args#1=<unknown type in tfop.so, CU 0x0, DIE 0x41c78>) at /usr/include/c++/5/functional:1871
1871 (*_Base::_M_get_pointer(__functor))(
[Current thread is 1 (Thread 0x7f38a6605700 (LWP 3771))]
(gdb) bt
#0 std::_Function_handler<void (long long, long long), ZeroOutOp::Compute(tensorflow::OpKernelContext*)::{lambda(long long, long long)#1}>::_M_invoke(std::_Any_data const&, long long&&, std::_Any_data const&) (__functor=..., __args#0=<unknown type in tfop.so, CU 0x0, DIE 0x41c73>, __args#1=<unknown type in tfop.so, CU 0x0, DIE 0x41c78>) at /usr/include/c++/5/functional:1871
#1 0x00007f3879dcc75d in tensorflow::thread::ThreadPool::Impl::ParallelFor(long long, long long, std::function<void (long long, long long)>) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007f3879dcc93f in tensorflow::thread::ThreadPool::ParallelFor(long long, long long, std::function<void (long long, long long)>) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007f3879d4d995 in tensorflow::Shard(int, tensorflow::thread::ThreadPool*, long long, long long, std::function<void (long long, long long)>) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007f3850bfb79e in ZeroOutOp::Compute (this=0x62dacc0, context=0x7ffd1a20fe30) at tf_op.cpp:42
#5 0x00007f3879a2563c in tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007f38799f5a58 in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007f38799f61fa in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x00007f3879a035c4 in std::_Function_handler<void (std::function<void ()>), tensorflow::GraphRunner::Run(tensorflow::Graph*, tensorflow::FunctionLibraryRuntime*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*)::{lambda(std::function<void ()>)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()>) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#9 0x00007f38799e895b in std::function<void (std::function<void ()>)>::operator()(std::function<void ()>) const ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#10 0x00007f38799e9043 in tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*) [clone .part.246] () from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#11 0x00007f38799ecf5e in tensorflow::(anonymous namespace)::ExecutorImpl::RunAsync(tensorflow::Executor::Args const&, std::function<void (tensorflow::Status const&)>) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#12 0x00007f3879a045e4 in tensorflow::GraphRunner::Run(tensorflow::Graph*, tensorflow::FunctionLibraryRuntime*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#13 0x00007f38799dce27 in tensorflow::ConstantFold(tensorflow::ConstantFoldingOptions const&, tensorflow::FunctionLibraryRuntime*, tensorflow::Env*, tensorflow::Device*, tensorflow::Graph*, bool*) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#14 0x00007f3879a02fea in tensorflow::GraphOptimizer::Optimize(tensorflow::FunctionLibraryRuntime*, tensorflow::Env*, tensorflow::Device*, std::unique_ptr<tensorflow::Graph, std::default_delete<tensorflow::Graph> >*) () from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#15 0x00007f38799a9469 in tensorflow::DirectSession::GetOrCreateExecutors(tensorflow::thread::ThreadPool*, tensorflow::gtl::ArraySlice<std::string>, tensorflow::gtl::ArraySlice<std::string>, tensorflow::gtl::ArraySlice<std::string>, tensorflow::DirectSession::ExecutorsAndKeys**, tensorflow::DirectSession::RunStateArgs*) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007f38799aa06c in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) () from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#17 0x00007f387799b2d7 in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#18 0x00007f387799b604 in TF_Run () from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#19 0x00007f38778037e2 in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#20 0x00007f3877803be1 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) () from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007f38777ca793 in _wrap_TF_Run () from /home/sperkins/venv/mb/local/lib/python2.7/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#22 0x00000000004c468a in PyEval_EvalFrameEx ()
#23 0x00000000004c2765 in PyEval_EvalCodeEx ()
#24 0x00000000004de6fe in ?? ()
#25 0x00000000004b0cb3 in PyObject_Call ()
#26 0x00000000004c6ad1 in PyEval_EvalFrameEx ()
#27 0x00000000004c2765 in PyEval_EvalCodeEx ()
#28 0x00000000004ca8d1 in PyEval_EvalFrameEx ()
#29 0x00000000004c2765 in PyEval_EvalCodeEx ()
---Type <return> to continue, or q <return> to quit---
#30 0x00000000004ca8d1 in PyEval_EvalFrameEx ()
#31 0x00000000004c2765 in PyEval_EvalCodeEx ()
#32 0x00000000004ca8d1 in PyEval_EvalFrameEx ()
#33 0x00000000004c2765 in PyEval_EvalCodeEx ()
#34 0x00000000004ca099 in PyEval_EvalFrameEx ()
#35 0x00000000004c2765 in PyEval_EvalCodeEx ()
#36 0x00000000004c2509 in PyEval_EvalCode ()
#37 0x00000000004f1def in ?? ()
#38 0x00000000004ec652 in PyRun_FileExFlags ()
#39 0x00000000004eae31 in PyRun_SimpleFileExFlags ()
#40 0x000000000049e14a in Py_Main ()
#41 0x00007f38a5e4c830 in __libc_start_main (main=0x49dab0 <main>, argc=2, argv=0x7ffd1a213618, init=<optimised out>, fini=<optimised out>, rtld_fini=<optimised out>, stack_end=0x7ffd1a213608)
at ../csu/libc-start.c:291
#42 0x000000000049d9d9 in _start ()
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 1
- Comments: 34 (25 by maintainers)
I did a bit of investigation, and this particular failure seems to be due to a change in
std::functionbetween GCC 4.X and 5.X. Interestingly, this change does not affect the ABI, and therefore there is no linking error.The failure could be fixed by swapping
std::functiontypes for function pointers, but for now the bottom line is that a custom op must be compiled with the same GCC/libstdc++ version as TF itself.This still occurs on tensorflow 1.9.0 and Ubuntu 16.04, g++ 5.4.0.
@superbobry, thanks - you just saved us from spending a lot of time in uber/horovod#542 digging through this.