tensorflow: Segmentation fault in _pywrap_tensorflow_internal.so

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS (Xenial)
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): (‘v1.4.0-19-ga52c8d9’, ‘1.4.1’)
  • Python version: 2.7.12
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: Cuda8, cudnn6
  • GPU model and memory: Titan Xp (with driver 387.26)
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

== cat /etc/issue ===============================================
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


== uname -a =====================================================
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy (1.14.0)
protobuf (3.5.1)
tensorflow-gpu (1.4.1)
tensorflow-tensorboard (0.4.0)

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.4.1
tf.GIT_VERSION = v1.4.0-19-ga52c8d9
tf.COMPILER_VERSION = v1.4.0-19-ga52c8d9
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH :/usr/local/cuda/lib64/
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Tue Jan 23 11:09:12 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 40%   66C    P2   182W / 250W |  11763MiB / 12189MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:05:00.0 Off |                  N/A |
| 23%   30C    P8     8W / 250W |  11591MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 28%   48C    P0    62W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 26%   46C    P0    63W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN Xp            Off  | 00000000:08:00.0 Off |                  N/A |
| 26%   46C    P0    63W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN Xp            Off  | 00000000:0C:00.0 Off |                  N/A |
| 23%   43C    P0    62W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN Xp            Off  | 00000000:0E:00.0 Off |                  N/A |
| 25%   44C    P0    62W / 250W |      0MiB / 12189MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN Xp            Off  | 00000000:0F:00.0 Off |                  N/A |
| 42%   69C    P2   167W / 250W |  11833MiB / 12189MiB |     31%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     19682      C   /home/peiliang/tensorflow/bin/python       11751MiB |
|    1     19682      C   /home/peiliang/tensorflow/bin/python       11579MiB |
|    7     27581      C   python                                     11823MiB |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.1/lib64/libcudart.so.9.1.85
/usr/local/cuda-9.1/lib64/libcudart_static.a
/usr/local/cuda-9.1/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.1/doc/man/man7/libcudart.7

You can obtain the TensorFlow version with

python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”

Describe the problem

My custom learning code works perfectly on my older workstation with 2 GPU cards. But am having issue with our new workstation which has 8 GPU cards. I get a Segmentation fault.

Source code / logs

The entire source code is: https://github.com/mpkuse/cartwheel_train/tree/config-files

The main-script is train_netvlad.py. Currently my learning data is private, if you really need it to test, I can provide the data as well (~100 GB).

My code basically builds a network with tf.slim. I have a custom operations to build a layer. Have a custom loss function. Can be found in CartWheelFlow.py/ class VGGDescriptor. It uses tf.while. Data is managed by class TimeMachineRender

stack trace for the crash.

$ gdb --args python train_netvlad.py -t tfsuper.logs/test 
(gdb) run
.
.
.
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffef7f61a2c in std::__detail::_Map_base<std::string, std::pair<std::string const, unsigned long>, std::allocator<std::pair<std::string const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) where
#0  0x00007ffef7f61a2c in std::__detail::_Map_base<std::string, std::pair<std::string const, unsigned long>, std::allocator<std::pair<std::string const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#1  0x00007ffef7f6c4d9 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00007ffef5ed94ea in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorfl---Type <return> to continue, or q <return> to quit---
ow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007ffef5ed9824 in TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007ffef5bf701a in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007ffef5bf7411 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/---Type <return> to continue, or q <return> to quit---
_pywrap_tensorflow_internal.so
#6  0x00007ffef5bbb6f1 in _wrap_TF_Run ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7  0x00000000004c45fa in PyEval_EvalFrameEx ()
#8  0x00000000004c2705 in PyEval_EvalCodeEx ()
#9  0x00000000004de69e in ?? ()
#10 0x00000000004b0c93 in PyObject_Call ()
#11 0x00000000004c6ef6 in PyEval_EvalFrameEx ()
#12 0x00000000004c2705 in PyEval_EvalCodeEx ()
#13 0x00000000004ca7df in PyEval_EvalFrameEx ()
#14 0x00000000004c2705 in PyEval_EvalCodeEx ()
#15 0x00000000004ca7df in PyEval_EvalFrameEx ()
#16 0x00000000004c2705 in PyEval_EvalCodeEx ()
#17 0x00000000004ca7df in PyEval_EvalFrameEx ()
#18 0x00000000004c2705 in PyEval_EvalCodeEx ()
#19 0x00000000004ca088 in PyEval_EvalFrameEx ()
#20 0x00000000004c2705 in PyEval_EvalCodeEx ()
#21 0x00000000004c24a9 in PyEval_EvalCode ()
#22 0x00000000004f19ef in ?? ()
#23 0x00000000004ec372 in PyRun_FileExFlags ()
#24 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#25 0x000000000049e208 in Py_Main ()
#26 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=4, 
    argv=0x7fffffffe558, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe548) at ../csu/libc-start.c:291
#27 0x000000000049da59 in _start ()

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 18 (5 by maintainers)

Most upvoted comments

I have the similar issue when I follow these steps https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md to train object detection. I use tensorflow object detection API and the model checkpoint is download from http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2017_11_08.tar.gz. The pipeline config file and datasets are exactly same with the provide one. The train.py can work normally for several steps, and then segmentation fault. Sample logs:

INFO:tensorflow:global step 4457: loss = 0.0524 (0.523 sec/step)
INFO:tensorflow:global step 4458: loss = 0.0071 (0.789 sec/step)
INFO:tensorflow:global step 4459: loss = 0.4697 (0.677 sec/step)
INFO:tensorflow:global step 4460: loss = 0.0488 (0.396 sec/step)
INFO:tensorflow:global step 4461: loss = 0.0560 (0.637 sec/step)
INFO:tensorflow:global step 4462: loss = 0.0424 (0.588 sec/step)
INFO:tensorflow:global step 4463: loss = 0.0227 (0.525 sec/step)
INFO:tensorflow:global step 4464: loss = 0.0826 (0.693 sec/step)

Thread 157 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff7b4ff9700 (LWP 19993)]
0x00007fff4d5cdd80 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
(gdb) where
#0  0x00007fff4d5cdd80 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#1  0x00007fff4d5cde25 in std::_Hashtable<std::string, std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*>, std::allocator<std::pair<std::string const, std::unordered_map<std::pair<unsigned long long, std::string>, tensorflow::ResourceBase*, tensorflow::ResourceMgr::KeyHash, tensorflow::ResourceMgr::KeyEqual, std::allocator<std::pair<std::pair<unsigned long long, std::string> const, tensorflow::ResourceBase*> > >*> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::find(std::string const&) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#2  0x00007fff4d5ce6e1 in tensorflow::ResourceMgr::DoLookup(std::string const&, std::type_index, std::string const&, tensorflow::ResourceBase**) const () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#3  0x00007fff4fbd36e6 in tensorflow::GetStack(tensorflow::OpKernelContext*, tensorflow::Stack**) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4  0x00007fff4fbd439d in tensorflow::StackPushOp<Eigen::GpuDevice>::ComputeAsync(tensorflow::OpKernelContext*, std::function<void ()>) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5  0x00007fff4da034ea in tensorflow::BaseGPUDevice::ComputeAsync(tensorflow::AsyncOpKernel*, tensorflow::OpKernelContext*, std::function<void ()>) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#6  0x00007fff4da3a17b in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#7  0x00007fff4da3a3ea in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(tensorflow::gtl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8> const&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#8  0x00007fff4d6d7782 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int) () from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#9  0x00007fff4d6d6832 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python2.7/dist-packages/tensorflow/python/../libtensorflow_framework.so
#10 0x00007fff462dfc80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007ffff7bc16ba in start_thread (arg=0x7ff7b4ff9700) at pthread_create.c:333
#12 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS (Xenial)

  • TensorFlow installed from (source or binary): Binary

  • TensorFlow version (use command below): (‘v1.4.0-19-ga52c8d9’, ‘1.4.1’)

  • Python version: 2.7.12

  • CUDA/cuDNN version: Cuda8, cudnn6

  • GPU model and memory: Titan Xp (with driver 387.26)