tensorflow: ResourceExhaustedError in CNN/MNIST example (with GPU)
(I’m using GPU(GTX 980) with CUDA-7.0&cuDNNv2, on Ubuntu 14.04) I have gone through MNIST tutorial: http://tensorflow.org/tutorials/mnist/pros/index.md
Everything was going well except for the last two lines:
print "test accuracy %g"%accuracy.eval(feed_dict={
x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})
Executing these lines, I got an error:
ResourceExhaustedError: OOM when allocating tensor with shapedim { size: 10000 } dim { size: 18 } dim { size: 18 } dim { size: 32 }
I think basic reason for this error is that test data can not be allocated to GPU device. Is this a bug or not? Are there good way to avoid this issue?
About this issue
- Original URL
- State: closed
- Created 9 years ago
- Reactions: 68
- Comments: 45 (5 by maintainers)
Links to this issue
Commits related to this issue
- [OpenCL] Fix allocator destruction race condition (#136) (#11968) * [OpenCL] Fix allocator destruction race condition (#136) * [OpenCL] Changes SYCL Interface construction Uses C++11 static ini... — committed to tensorflow/tensorflow by deleted user 7 years ago
- [OpenCL] Fix allocator destruction race condition (#136) * [OpenCL] Changes SYCL Interface construction Uses C++11 static initialisation to provide singleton instance, rather than a mutex and pointe... — committed to codeplaysoftware/tensorflow by jwlawson 7 years ago
- Compute test accuracy in batches to avoid OOM on GPUs. Reported here: https://github.com/tensorflow/tensorflow/issues/136 Alternative to this without changing convolutional.py: https://github.com/tens... — committed to thisisrandy/tensorflow by thisisrandy 7 years ago
- Compute test accuracy in batches to avoid OOM on GPUs. Reported here: https://github.com/tensorflow/tensorflow/issues/136 Alternative to this for mnist_deep.py: https://github.com/tensorflow/tensorflo... — committed to thisisrandy/tensorflow by thisisrandy 7 years ago
- Upgrade/fix/simplify store to load forwarding - fix store to load forwarding for a certain set of cases (where forwarding shouldn't have happened); use AffineValueMap difference based MemRefAcces... — committed to tensorflow/tensorflow by bondhugula 5 years ago
- Upgrade/fix/simplify store to load forwarding - fix store to load forwarding for a certain set of cases (where forwarding shouldn't have happened); use AffineValueMap difference based MemRefAcces... — committed to hristo-vrigazov/tensorflow by bondhugula 5 years ago
- Merge pull request #136 from yxsamliu/hip-clang-philox Add __device__ to PHILOX_DEVICE_FUNC for hip-clang — committed to Cerebras/tensorflow by whchung 6 years ago
If you don’t have enough memory on your GPU to fit the whole test data, you could feed it in small batches to the eval graph using feed_dict like the example does with the training data.
I am doing similar things as @Shuto050505 did here, but I compute the mean of accuracy for each batch in all test data. Replace the following line:
by this:
I will get the mean of test accuracy
test accuracy 0.9922.EDIT: Updated accuracy with corrected training process
“Just reduce the batch size while feeding the test data to GPU” batch_size=1
Now what? 0.5, 0?
Have 2 1080ti’s, and sill running out of memory.
I’m using a 2GB 860M. mnist.test.image.shape is (10000,784). Restricting shape_size to 7000, I could do it.
batch size
batch_tx, batch_ty = mnist.test.next_batch(10)print("test accuracy %g"%accuracy.eval(feed_dict={x:batch_tx, y_: batch_ty, keep_prob: 1.0}))-> test accuracy 0.992or
slice
test_image = mnist.test.images[0:7000, :]test_label = mnist.test.labels[0:7000, :]print("test accuracy %g"%accuracy.eval(feed_dict={x: test_image, y_: test_label, keep_prob: 1.0}))->test accuracy 0.992FYI: I had the same error (Ram 32GB, Titan X 12GB). Restarting iPython notebook helped.
I have this same problem on this same example with Ubuntu 14.04 with Nvidia GTX 970. Hits 3.35 GB usage and then crashes.
@mtourne Thanks a lot for the fix. It works for my GPU (previously I also had Resource Exhausted error).
Another way following @vrv 's advice from this discussion is to install tensorflow from the latest source and configure your session to use BFC allocator before running it like this:
config = tf.ConfigProto() config.gpu_options.allocator_type = ‘BFC’ with tf.Session(config = config) as s:
The original example from @vrv is here. My GPU has 6 GB though. This allocator seems to dynamically allocate memory according to GPU’s memory so it will probably work with cards having less memory as well.
I was trying the Deep MNIST from the tutorial on the site. Using the entire test set throws the ran out of memory trying to allocate 78.1KiB (So close!) Changing the batches of the test images works, But pushing it to batches of 5000, just so I could see where exactly the program breaks, gave me a warning that it ran out of memory. The strange thing is that, while the previous attempt, without batching, completely broke the program, this didn’t. I still got the result, but with a lot of warning that says:
I am using a 2GB 960M and I can confirm that the BFC option is enabled, since there were related errors with chunks when it completely crashed the first time.
Same issue with the latest stable release of Tensorflow, Quadro 970M with 2GB mem.
According the logging output, BFC allocator was being used.
29.9 MB doesn’t seem like an awful lot, but I’m assuming it has to be allocated contiguously.
I have had a similar problem just now. Batch of images had to be analysed and I, by mistake, created a function, that was loading a model into the memory for every image to be recognised.
Hopefully, it will help someone!
I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,
Character level Embedding Vector -> Embedding lookup -> LSTM1
Word level Embedding Vector->Embedding lookup -> LSTM2
[LSTM1+LSTM2] -> single layer MLP-> softmax layer
[LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator
while I’m working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn’t work.
tensor with shape[24760,100] means 247600032/10241024 = 75.*** MB memory. I am running the code on a titan X(11 GB) gpu. What could go wrong. Why this type of error occured?
Extra info: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200. The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.
Note : The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.