tensorflow: ResourceExhaustedError in CNN/MNIST example (with GPU)

(I’m using GPU(GTX 980) with CUDA-7.0&cuDNNv2, on Ubuntu 14.04) I have gone through MNIST tutorial: http://tensorflow.org/tutorials/mnist/pros/index.md

Everything was going well except for the last two lines:

print "test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})

Executing these lines, I got an error:

ResourceExhaustedError: OOM when allocating tensor with shapedim { size: 10000 } dim { size: 18 } dim { size: 18 } dim { size: 32 }

I think basic reason for this error is that test data can not be allocated to GPU device. Is this a bug or not? Are there good way to avoid this issue?

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Reactions: 68
  • Comments: 45 (5 by maintainers)

Commits related to this issue

Most upvoted comments

If you don’t have enough memory on your GPU to fit the whole test data, you could feed it in small batches to the eval graph using feed_dict like the example does with the training data.

I am doing similar things as @Shuto050505 did here, but I compute the mean of accuracy for each batch in all test data. Replace the following line:

print("test accuracy %g"%accuracy.eval(feed_dict={ 
      x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

by this:

batch_size = 50
batch_num = int(mnist.test.num_examples / batch_size)
test_accuracy = 0
    
for i in range(batch_num):
    batch = mnist.test.next_batch(batch_size)
    test_accuracy += accuracy.eval(feed_dict={x: batch[0],
                                              y_: batch[1],
                                              keep_prob: 1.0})

test_accuracy /= batch_num
print("test accuracy %g"%test_accuracy)

I will get the mean of test accuracy test accuracy 0.9922.

EDIT: Updated accuracy with corrected training process

“Just reduce the batch size while feeding the test data to GPU” batch_size=1

Now what? 0.5, 0?

Have 2 1080ti’s, and sill running out of memory.

I’m using a 2GB 860M. mnist.test.image.shape is (10000,784). Restricting shape_size to 7000, I could do it.

batch size batch_tx, batch_ty = mnist.test.next_batch(10)
print("test accuracy %g"%accuracy.eval(feed_dict={x:batch_tx, y_: batch_ty, keep_prob: 1.0})) -> test accuracy 0.992

or

slice test_image = mnist.test.images[0:7000, :]
test_label = mnist.test.labels[0:7000, :]
print("test accuracy %g"%accuracy.eval(feed_dict={x: test_image, y_: test_label, keep_prob: 1.0})) ->test accuracy 0.992

FYI: I had the same error (Ram 32GB, Titan X 12GB). Restarting iPython notebook helped.

I have this same problem on this same example with Ubuntu 14.04 with Nvidia GTX 970. Hits 3.35 GB usage and then crashes.

@mtourne Thanks a lot for the fix. It works for my GPU (previously I also had Resource Exhausted error).

Another way following @vrv 's advice from this discussion is to install tensorflow from the latest source and configure your session to use BFC allocator before running it like this:

config = tf.ConfigProto() config.gpu_options.allocator_type = ‘BFC’ with tf.Session(config = config) as s:

The original example from @vrv is here. My GPU has 6 GB though. This allocator seems to dynamically allocate memory according to GPU’s memory so it will probably work with cards having less memory as well.

I was trying the Deep MNIST from the tutorial on the site. Using the entire test set throws the ran out of memory trying to allocate 78.1KiB (So close!) Changing the batches of the test images works, But pushing it to batches of 5000, just so I could see where exactly the program breaks, gave me a warning that it ran out of memory. The strange thing is that, while the previous attempt, without batching, completely broke the program, this didn’t. I still got the result, but with a lot of warning that says:

W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:217] Ran out of memory trying to allocate 2.92GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.

I am using a 2GB 960M and I can confirm that the BFC option is enabled, since there were related errors with chunks when it completely crashed the first time.

Same issue with the latest stable release of Tensorflow, Quadro 970M with 2GB mem.

According the logging output, BFC allocator was being used.

W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:211] Ran out of memory trying to allocate 29.91MiB.  See logs for memory state

29.9 MB doesn’t seem like an awful lot, but I’m assuming it has to be allocated contiguously.

I have had a similar problem just now. Batch of images had to be analysed and I, by mistake, created a function, that was loading a model into the memory for every image to be recognised.

def predict(num_class, weights_path, img_path):    
	base_model = VGG16.VGG16(include_top=False, weights=None)
	x = base_model.output
	x = Dense(128)(x)
	x = GlobalAveragePooling2D()(x)
	predictions = Dense(num_class, activation='softmax')(x)

	model = Model(inputs=base_model.input, outputs=predictions)
	model.load_weights(weights_path)
	....
	....

Hopefully, it will help someone!

I am new to tensorflow and Machine Learning. Recently I am working on a model. My model is like below,

  1. Character level Embedding Vector -> Embedding lookup -> LSTM1

  2. Word level Embedding Vector->Embedding lookup -> LSTM2

  3. [LSTM1+LSTM2] -> single layer MLP-> softmax layer

  4. [LSTM1+LSTM2] -> Single layer MLP-> WGAN discriminator

while I’m working on this model I got the following error. I thought My batch is too big. Thus I tried to reduce the batch size from 20 to 10 but it doesn’t work.

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[24760,100] [[Node: chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device=“/job:localhost/replica:0/task:0/device:GPU:0”](gradients_2/Add_3/y, chars/bidirectional_rnn/bw/bw/while/bw/lstm_cell/BiasAdd)]] [[Node: bi-lstm/bidirectional_rnn/bw/bw/stack/_167 = _Recvclient_terminated=false, recv_device=“/job:localhost/replica:0/task:0/device:CPU:0”, send_device=“/job:localhost/replica:0/task:0/device:GPU:0”, send_device_incarnation=1, tensor_name=“edge_636_bi-lstm/bidirectional_rnn/bw/bw/stack”, tensor_type=DT_INT32, _device=“/job:localhost/replica:0/task:0/device:CPU:0”]]

tensor with shape[24760,100] means 247600032/10241024 = 75.*** MB memory. I am running the code on a titan X(11 GB) gpu. What could go wrong. Why this type of error occured?

Extra info: the size of the LSTM1 is 100. for bidirectional LSTM it becomes 200. The size of the LSTM2 is 300. For Bidirectional LSTM it becomes 600.

Note : The error occurred after 32 epoch. My question is why after 32 epoch there is an error. Why not at the initial epoch.