tensorforce: Quickstart example get stuck [GPU]

Hi,

I just installed tensorforce (from pip) with tensorflow-gpu 1.7 and tried to run example/quickstart.py. The training starts but then gets stucks after n episodes where n is the minimum of batch_size and frequency value of the update_mode argument of PPOAgent.

update_mode=dict(
    unit='episodes',
    # 10 episodes per update
    batch_size=20,
    # Every 10 episodes
    frequency=20
),

No error message is displayed, it just hangs forever. Has anyone experienced something similar?

Thanks,

About this issue

Original URL
State: closed
Created 6 years ago
Comments: 28 (13 by maintainers)

Most upvoted comments

Oh, I probably have the same issue, then. The timeout didn’t help, it is stuck, still. If I press ctrl+c it just ignores it. I was having the problem with another training, but the error is the same in the quickstart.

I tried attaching gdb on it but it’s kind of hard to interpret. Looks like some kind of deadlock, possibly not a bug in tensorforce.


#0  0x00007f863dd73f09 in syscall () from /usr/lib/libc.so.6

#1  0x00007f861d6f776d in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_*, timespec) ()
   from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

#2  0x00007f861d6f6ef1 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter*, timespec, nsync::nsync_note_s_*) ()
   from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

#3  0x00007f861d6f43e4 in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_*, void*, void (*)(void*), void (*)(void*), timespec, nsync::nsync_note_s_*) ()
   from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

#4  0x00007f861d6f48c3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_*, nsync::nsync_mu_s_*, timespec, nsync::nsync_note_s_*) ()
   from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

#5  0x00007f861d6fa62b in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) ()
   from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

#6  0x00007f861d6fa67b in tensorflow::DirectSession::WaitForNotification(tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) ()
   from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

#7  0x00007f861d6fe431 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*) () from /home/sohakes/.conda/envs/aula-ml/lib/python3.6/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

...

#85 0x00007f863edb704d in run_file (p_cf=0x7ffea08f3a70, filename=0x1692d10 L"quickstart.py", fp=0x175d9f0) at Modules/main.c:320

#86 Py_Main (argc=argc@entry=2, argv=argv@entry=0x16908e0) at Modules/main.c:780

#87 0x0000000000400bbc in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

(omitted the middle of the stack)

Found this recent issue on tensorflow, maybe related https://github.com/tensorflow/tensorflow/issues/18737 ? It looks like it should happen only on distributed mode though, which isn’t the case.

Using tensorflow 1.8.

sohakes on May 4, 2018

Same issue on tensorflow-gpu 1.7. By downgrading to 1.5 as @gian1312 said, it works again.

oatuy on May 2, 2018

I had the same issue. By downgrading on Tensorflow-gpu 1.5 it started working again.

gian1312 on Apr 24, 2018