tensorflow: TF freezes and gets killed while training /saving a network
I am trying to train a deep network from scratch (a 4 layer CIFAR network) on an image collection of 100K images. The TF instance hangs (while training or while saving using tf.Saver) and then gets killed without any error message.
I’ve tried the following things without any use:
a. Reduced the batch size from 32 to 8.
b. Set config’s allow GPU growth option to True
But the problem still persists.
Has anybody else faced this issue? Is this because of insufficient memory? Is there a way to train a model under constrained memory conditions (although, 12 GB isn’t bad)? Any tips to avoid this would be very helpful.
What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
I’ve looked at other similar issues posted but haven’t found any useful solution. https://github.com/tensorflow/tensorflow/issues/2121 http://stackoverflow.com/questions/38958737/tensorflow-training-got-stuck-after-some-steps-how-to-investigate https://github.com/tensorflow/tensorflow/issues/1962
Environment info
GPU details: I am running this model on a Tesla K40c (12GB memory). Operating System: 4.7.0-1-amd64 #1 SMP Debian 4.7.6-1 (2016-10-07) x86_64 GNU/Linux
Installed version of CUDA and cuDNN: /opt/cuda-8.0/lib64/libcudnn.so.5 /opt/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
(Cuda version: 8.0 and cuDNN version 5)
- The output from
python -c "import tensorflow; print(tensorflow.__version__)"
. 0.11.0rc1
If installed from source, provide
- The commit hash (
git rev-parse HEAD
) ec7f37e40fedb23435bfb7e28668e5fa63ff52f3 - The output of
bazel version
Build label: 0.3.2 Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar Build time: Fri Oct 7 17:25:10 2016 (1475861110) Build timestamp: 1475861110 Build timestamp as int: 1475861110
If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)
This issue is happening when I am trying to train/ save a model
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 16 (8 by maintainers)
@nroth1 – a common way for “random hangs” is by causing a deadlock with queues, partial solution is to set operation timeout as here and retry on error. More fundamentally is to analyze your queues and prevent deadlock from occuring, as here
On Tue, Nov 1, 2016 at 2:45 PM, nroth1 notifications@github.com wrote: