tensorflow: Segmentation fault in _pywrap_tensorflow_internal.so
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS (Xenial)
- TensorFlow installed from (source or binary): Binary
- TensorFlow version (use command below): (‘v1.4.0-19-ga52c8d9’, ‘1.4.1’)
- Python version: 2.7.12
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: Cuda8, cudnn6
- GPU model and memory: Titan Xp (with driver 387.26)
- Exact command to reproduce:
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
== cat /etc/issue ===============================================
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
VERSION="16.04.3 LTS (Xenial Xerus)"
VERSION_ID="16.04"
VERSION_CODENAME=xenial
== are we in docker =============================================
No
== compiler =====================================================
c++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux ubuntu 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy (1.14.0)
protobuf (3.5.1)
tensorflow-gpu (1.4.1)
tensorflow-tensorboard (0.4.0)
== check for virtualenv =========================================
False
== tensorflow import ============================================
tf.VERSION = 1.4.1
tf.GIT_VERSION = v1.4.0-19-ga52c8d9
tf.COMPILER_VERSION = v1.4.0-19-ga52c8d9
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH :/usr/local/cuda/lib64/
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
Tue Jan 23 11:09:12 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26 Driver Version: 387.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 40% 66C P2 182W / 250W | 11763MiB / 12189MiB | 73% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:05:00.0 Off | N/A |
| 23% 30C P8 8W / 250W | 11591MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:06:00.0 Off | N/A |
| 28% 48C P0 62W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:07:00.0 Off | N/A |
| 26% 46C P0 63W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 TITAN Xp Off | 00000000:08:00.0 Off | N/A |
| 26% 46C P0 63W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 TITAN Xp Off | 00000000:0C:00.0 Off | N/A |
| 23% 43C P0 62W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 TITAN Xp Off | 00000000:0E:00.0 Off | N/A |
| 25% 44C P0 62W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 TITAN Xp Off | 00000000:0F:00.0 Off | N/A |
| 42% 69C P2 167W / 250W | 11833MiB / 12189MiB | 31% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 19682 C /home/peiliang/tensorflow/bin/python 11751MiB |
| 1 19682 C /home/peiliang/tensorflow/bin/python 11579MiB |
| 7 27581 C python 11823MiB |
+-----------------------------------------------------------------------------+
== cuda libs ===================================================
/usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
/usr/local/cuda-8.0/lib64/libcudart_static.a
/usr/local/cuda-8.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-8.0/doc/man/man7/libcudart.7
/usr/local/cuda-9.1/lib64/libcudart.so.9.1.85
/usr/local/cuda-9.1/lib64/libcudart_static.a
/usr/local/cuda-9.1/doc/man/man7/libcudart.so.7
/usr/local/cuda-9.1/doc/man/man7/libcudart.7
You can obtain the TensorFlow version with
python -c “import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)”
Describe the problem
My custom learning code works perfectly on my older workstation with 2 GPU cards. But am having issue with our new workstation which has 8 GPU cards. I get a Segmentation fault.
Source code / logs
The entire source code is: https://github.com/mpkuse/cartwheel_train/tree/config-files
The main-script is train_netvlad.py. Currently my learning data is private,
if you really need it to test, I can provide the data as well (~100 GB).
My code basically builds a network with tf.slim. I have a custom operations to build a layer. Have a custom loss function. Can be found in CartWheelFlow.py/ class VGGDescriptor. It uses tf.while.
Data is managed by class TimeMachineRender
stack trace for the crash.
$ gdb --args python train_netvlad.py -t tfsuper.logs/test
(gdb) run
.
.
.
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007ffef7f61a2c in std::__detail::_Map_base<std::string, std::pair<std::string const, unsigned long>, std::allocator<std::pair<std::string const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
(gdb) where
#0 0x00007ffef7f61a2c in std::__detail::_Map_base<std::string, std::pair<std::string const, unsigned long>, std::allocator<std::pair<std::string const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#1 0x00007ffef7f6c4d9 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2 0x00007ffef5ed94ea in TF_Run_Helper(tensorflow::Session*, char const*, TF_Buffer const*, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorfl---Type <return> to continue, or q <return> to quit---
ow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, TF_Tensor**, std::vector<std::string, std::allocator<std::string> > const&, TF_Buffer*, TF_Status*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3 0x00007ffef5ed9824 in TF_Run ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007ffef5bf701a in tensorflow::TF_Run_wrapper_helper(TF_DeprecatedSession*, char const*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007ffef5bf7411 in tensorflow::TF_Run_wrapper(TF_DeprecatedSession*, TF_Buffer const*, _object*, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/---Type <return> to continue, or q <return> to quit---
_pywrap_tensorflow_internal.so
#6 0x00007ffef5bbb6f1 in _wrap_TF_Run ()
from /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00000000004c45fa in PyEval_EvalFrameEx ()
#8 0x00000000004c2705 in PyEval_EvalCodeEx ()
#9 0x00000000004de69e in ?? ()
#10 0x00000000004b0c93 in PyObject_Call ()
#11 0x00000000004c6ef6 in PyEval_EvalFrameEx ()
#12 0x00000000004c2705 in PyEval_EvalCodeEx ()
#13 0x00000000004ca7df in PyEval_EvalFrameEx ()
#14 0x00000000004c2705 in PyEval_EvalCodeEx ()
#15 0x00000000004ca7df in PyEval_EvalFrameEx ()
#16 0x00000000004c2705 in PyEval_EvalCodeEx ()
#17 0x00000000004ca7df in PyEval_EvalFrameEx ()
#18 0x00000000004c2705 in PyEval_EvalCodeEx ()
#19 0x00000000004ca088 in PyEval_EvalFrameEx ()
#20 0x00000000004c2705 in PyEval_EvalCodeEx ()
#21 0x00000000004c24a9 in PyEval_EvalCode ()
#22 0x00000000004f19ef in ?? ()
#23 0x00000000004ec372 in PyRun_FileExFlags ()
#24 0x00000000004eaaf1 in PyRun_SimpleFileExFlags ()
#25 0x000000000049e208 in Py_Main ()
#26 0x00007ffff7810830 in __libc_start_main (main=0x49db30 <main>, argc=4,
argv=0x7fffffffe558, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffe548) at ../csu/libc-start.c:291
#27 0x000000000049da59 in _start ()
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 18 (5 by maintainers)
I have the similar issue when I follow these steps https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md to train object detection. I use tensorflow object detection API and the model checkpoint is download from http://download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_kitti_2017_11_08.tar.gz. The pipeline config file and datasets are exactly same with the provide one. The train.py can work normally for several steps, and then segmentation fault. Sample logs:
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.3 LTS (Xenial)
TensorFlow installed from (source or binary): Binary
TensorFlow version (use command below): (‘v1.4.0-19-ga52c8d9’, ‘1.4.1’)
Python version: 2.7.12
CUDA/cuDNN version: Cuda8, cudnn6
GPU model and memory: Titan Xp (with driver 387.26)