tensorflow: Tensorflow hangs without explanation, some threads stay busy.

I’ve got a machine running a GTX 980 Ti with 32 gb of ram (of which about half is used by the data loading pipeline) and a i7-5930K cpu driving a network. I’m running Ubuntu 14.04, I’ve got the drivers x86_64-361.45.11 installed, I’ve installed CUDA 7.5.18_linux with CUDNN linux-x64-v4.0-prod. Every 30 or so iterations the run method hangs and 2 to 5 virtual cores are busy, with the gpu not doing anything more than idling. I thought it hanged indefinitely but when I ran it overnight I saw that it actually completed what it was doing after some time. Pausing and continuing the process using gdb will make it continue immediately. I saw something like this with the initial release of 0.6.0 but it was fixed by 0.6.1 so I just pulled from master and assumed I would never see it again. Inspecting the root process just shows that its waiting on the run method:

(gdb) bt full
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x00007f29299964dc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#2  0x00007f292b46a883 in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, long long) ()
   from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#3  0x00007f292b474800 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*, tensorflow::RunMetadata*) () from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#4  0x00007f292b549c81 in TF_Run_Helper(TF_Session*, char const*, TF_Buffer const*, char const**, TF_Tensor**, int, char const**, TF_Tensor**, int, char const**, int, TF_Buffer*, TF_Status*) ()
   from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#5  0x00007f292b54a0f1 in TF_Run () from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#6  0x00007f292a5787b2 in tensorflow::TF_Run_wrapper_helper(TF_Session*, char const*, TF_Buffer const*, tensorflow::gtl::InlinedVector<std::pair<char const*, tagPyArrayObject*>, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#7  0x00007f292a578de1 in tensorflow::TF_Run_wrapper(TF_Session*, TF_Buffer const*, tensorflow::gtl::InlinedVector<std::pair<char const*, tagPyArrayObject*>, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, tensorflow::gtl::InlinedVector<char const*, 8> const&, TF_Status*, tensorflow::gtl::InlinedVector<_object*, 8>*, TF_Buffer*) ()
   from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#8  0x00007f292a5666c8 in _wrap_TF_Run () from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
...

Inspection of one of the busy threads is also equally uninteresting:

(gdb) bt full
#0  0x00007f293767e3f7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1  0x00007f292b76aa60 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(unsigned int) ()
   from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#2  0x00007f292b769f52 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
   from /usr/local/lib/python3.4/dist-packages/tensorflow/python/_pywrap_tensorflow.so
No symbol table info available.
#3  0x00007f2929999a60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#4  0x00007f2937980182 in start_thread (arg=0x7f28657fa700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f28657fa700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139811478284032, -825179003405916230, 1, 0, 139811478284736, 139811478284032, 782688377649571770, 783080021823268794}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#5  0x00007f29376ad47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

Finally the cpu usage revealed by htop doesn’t seem to be fully accounted for in the displayed processes, 5 virtual cores might be busy but the processes are only credited with 0-300% cpu usage.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 1
Comments: 41 (10 by maintainers)

Most upvoted comments

You could also create all your sessions with “config.operation_timeout_in_ms=60000” for your ConfigProto so you crash with DeadlineExceeded instead of hanging forever when your queues get full/empty

On Fri, Jun 10, 2016 at 2:21 PM, alexatknit notifications@github.com wrote:

I do have a queue for preloading data, though the model is hanging after everything is executed, meaning that it already successfully executed the dequeue node. I’ll try it without preloading, to remove any queue ops from the graph and see what happens.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/2788#issuecomment-225298850, or mute the thread https://github.com/notifications/unsubscribe/AABaHAvpGO-lRCwSw2KhmDo8FZxaIn7Fks5qKdVZgaJpZM4IzLWA .

yaroslavvb on Jun 10, 2016