tensorflow: TF freezes and gets killed while training /saving a network

I am trying to train a deep network from scratch (a 4 layer CIFAR network) on an image collection of 100K images. The TF instance hangs (while training or while saving using tf.Saver) and then gets killed without any error message.

I’ve tried the following things without any use:

a. Reduced the batch size from 32 to 8.

b. Set config’s allow GPU growth option to True

But the problem still persists.

Has anybody else faced this issue? Is this because of insufficient memory? Is there a way to train a model under constrained memory conditions (although, 12 GB isn’t bad)? Any tips to avoid this would be very helpful.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

I’ve looked at other similar issues posted but haven’t found any useful solution. https://github.com/tensorflow/tensorflow/issues/2121 http://stackoverflow.com/questions/38958737/tensorflow-training-got-stuck-after-some-steps-how-to-investigate https://github.com/tensorflow/tensorflow/issues/1962

Environment info

GPU details: I am running this model on a Tesla K40c (12GB memory). Operating System: 4.7.0-1-amd64 #1 SMP Debian 4.7.6-1 (2016-10-07) x86_64 GNU/Linux

Installed version of CUDA and cuDNN: /opt/cuda-8.0/lib64/libcudnn.so.5 /opt/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0

(Cuda version: 8.0 and cuDNN version 5)

  1. The output from python -c "import tensorflow; print(tensorflow.__version__)". 0.11.0rc1

If installed from source, provide

  1. The commit hash (git rev-parse HEAD) ec7f37e40fedb23435bfb7e28668e5fa63ff52f3
  2. The output of bazel version

Build label: 0.3.2 Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar Build time: Fri Oct 7 17:25:10 2016 (1475861110) Build timestamp: 1475861110 Build timestamp as int: 1475861110

If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)

This issue is happening when I am trying to train/ save a model

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

@nroth1 – a common way for “random hangs” is by causing a deadlock with queues, partial solution is to set operation timeout as here and retry on error. More fundamentally is to analyze your queues and prevent deadlock from occuring, as here

On Tue, Nov 1, 2016 at 2:45 PM, nroth1 notifications@github.com wrote:

I am seeing what I think is a similar issue, but am only training on CPU; when I sample the process in a hung state, I see get the following info. In my case, I just see an indefinite hang. Seems to happen randomly, but consistently if I run the program for a few hours, unfortunately. I am also on version, 0.11.0rc0, if that matters.

tflow_hang.txt https://github.com/tensorflow/tensorflow/files/565125/tflow_hang.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tensorflow/tensorflow/issues/5289#issuecomment-257708344, or mute the thread https://github.com/notifications/unsubscribe-auth/AABaHBm5M5lHsKvIUr-8rrAXgy2CARPhks5q57LjgaJpZM4KkfA- .