TFFRCNN: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096]

I am trying to perform faster rcnn on a custom dataset based on pascal_VOC. But I get this error when I start to train:

Stats: Limit: 1696386252 InUse: 1685909760 MaxInUse: 1696386048 NumAllocs: 152 MaxAllocSize: 533417472

2017-06-03 04:51:48.992694: W tensorflow/core/common_runtime/bfc_allocator.cc:277] *********************************************************************************************xxxxxxx 2017-06-03 04:51:48.992751: W tensorflow/core/framework/op_kernel.cc:1152] Resource exhausted: OOM when allocating tensor with shape[25088,4096] Traceback (most recent call last): File “./faster_rcnn/train_net.py”, line 109, in <module> restore=bool(int(args.restore))) File “./faster_rcnn/…/lib/fast_rcnn/train.py”, line 400, in train_net sw.train_model(sess, max_iters, restore=restore) File “./faster_rcnn/…/lib/fast_rcnn/train.py”, line 148, in train_model sess.run(tf.global_variables_initializer()) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 778, in run run_metadata_ptr) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 982, in _run feed_dict_string, options, run_metadata) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 1032, in _do_run target_list, options, run_metadata) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py”, line 1052, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4096] [[Node: fc6/biases/Momentum/Assign = Assign[T=DT_FLOAT, _class=[“loc:@fc6/biases”], use_locking=true, validate_shape=true, _device=“/job:localhost/replica:0/task:0/gpu:0”](fc6/biases/Momentum, fc6/biases/Momentum/Initializer/Const)]]

Caused by op u’fc6/biases/Momentum/Assign’, defined at: File “./faster_rcnn/train_net.py”, line 109, in <module> restore=bool(int(args.restore))) File “./faster_rcnn/…/lib/fast_rcnn/train.py”, line 400, in train_net sw.train_model(sess, max_iters, restore=restore) File “./faster_rcnn/…/lib/fast_rcnn/train.py”, line 143, in train_model train_op = opt.apply_gradients(zip(grads, tvars), global_step=global_step) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py”, line 446, in apply_gradients self._create_slots([_get_variable_for(v) for v in var_list]) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/training/momentum.py”, line 63, in _create_slots self._zeros_slot(v, “momentum”, self._name) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py”, line 766, in _zeros_slot named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py”, line 174, in create_zeros_slot colocate_with_primary=colocate_with_primary) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py”, line 146, in create_slot_with_initializer dtype) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/training/slot_creator.py”, line 66, in _create_slot_var validate_shape=validate_shape) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py”, line 1049, in get_variable use_resource=use_resource, custom_getter=custom_getter) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py”, line 948, in get_variable use_resource=use_resource, custom_getter=custom_getter) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py”, line 356, in get_variable validate_shape=validate_shape, use_resource=use_resource) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py”, line 341, in _true_getter use_resource=use_resource) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py”, line 714, in _get_single_variable validate_shape=validate_shape) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py”, line 197, in init expected_shape=expected_shape) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/variables.py”, line 306, in _init_from_args validate_shape=validate_shape).op File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/state_ops.py”, line 270, in assign validate_shape=validate_shape) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_state_ops.py”, line 47, in assign use_locking=use_locking, name=name) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py”, line 768, in apply_op op_def=op_def) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py”, line 2336, in create_op original_op=self._default_original_op, op_def=op_def) File “/home/hadi/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py”, line 1228, in init self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096] [[Node: fc6/biases/Momentum/Assign = Assign[T=DT_FLOAT, _class=[“loc:@fc6/biases”], use_locking=true, validate_shape=true, _device=“/job:localhost/replica:0/task:0/gpu:0”](fc6/biases/Momentum, fc6/biases/Momentum/Initializer/Const)]]`

How can I make this work? I don’t know how to reduce batch size and see if that helps.

About this issue

Most upvoted comments

Fix! Using beneth setting! Start train now! Iter 450/70000(may need couple hours) config = tf.ConfigProto() config.gpu_options.allocator_type =‘BFC’ config.gpu_options.per_process_gpu_memory_fraction = 0.90

to keep an eye on the GPU usage: sudo watch mvidia-smi

you can twist the source code for expending gpu memory usage: try modify the parameters on the line below in TFFRCNN/lib/fast_rcnn/train.py

config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.per_process_gpu_memory_fraction = 0.40

config.gpu_options.allow_growth=True is always my favored option since you don’t need to care about the actual usage.

I got a similar error and it got resolved by reducing the batch_size.

take a look at the memory usage of your gpu with

nvidia-smi -l 3

if still free memory that can be used -> edit lib/fast_rcnn/train.py ~line396 : config.gpu_options.per_process_gpu_memory_fraction = 0.40 edit 0.40 to larger propotion (e.g. 0.9)