tensorpack: Exception in thread EnqueueThread QueueInput/input_queue - msgpack exceeds max_bin_len()

1. What you did:

(1) If you’re using examples, what’s the command you run: python3 train.py --config MODE_MASK=True MODE_FPN=True DATA.BASEDIR="coco" BACKBONE.WEIGHTS=R50.npz

(2) If you’re using examples, have you made any changes to the examples? Paste git status; git diff here: Changing config to adapt to widerface dataset

2. What you observed:

(1) Include the ENTIRE logs here:

[32m[0515 23:27:44 @base.py:275] Start Epoch 15 ...
[0515 23:28:41 @input_source.py:173] ERR [EnqueueThread] Exception in thread EnqueueThread QueueInput/input_queue:
Traceback (most recent call last):
  File "/home/nga/FasterRCNN/tensorpack/input_source/input_source.py", line 161, in run
    dp = next(self._itr)
  File "/home/nga/FasterRCNN/tensorpack/dataflow/common.py", line 370, in __iter__
    for dp in self.ds:
  File "/home/nga/FasterRCNN/tensorpack/dataflow/parallel_map.py", line 314, in __iter__
    for dp in super(MultiProcessMapDataZMQ, self).__iter__():
  File "/home/nga/FasterRCNN/tensorpack/dataflow/parallel_map.py", line 89, in __iter__
    for dp in self.get_data_non_strict():
  File "/home/nga/FasterRCNN/tensorpack/dataflow/parallel_map.py", line 65, in get_data_non_strict
    ret = self._recv()
  File "/home/nga/FasterRCNN/tensorpack/dataflow/parallel_map.py", line 309, in _recv
    dp = loads(msg[1])
  File "/home/nga/FasterRCNN/tensorpack/utils/serialize.py", line 43, in loads_msgpack
    max_str_len=MAX_MSGPACK_LEN)
  File "/home/nga/.local/lib/python3.6/site-packages/msgpack_numpy.py", line 214, in unpackb
    return _unpackb(packed, **kwargs)
  File "msgpack/_unpacker.pyx", line 200, in msgpack._unpacker.unpackb
ValueError: 1916481600 exceeds max_bin_len(1000000000)
[0515 23:28:41 @input_source.py:179] [EnqueueThread] Thread EnqueueThread QueueInput/input_queue Exited.
[0515 23:28:58 @base.py:291] Training was stopped by exception FIFOQueue '_0_QueueInput/input_queue' is closed and has insufficient elements (requested 1, current size 0)
	 [[node QueueInput/input_deque (defined at /home/nga/FasterRCNN/tensorpack/input_source/input_source.py:272)  = QueueDequeueV2[component_types=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT64, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](QueueInput/input_queue)]]

Caused by op 'QueueInput/input_deque', defined at:
  File "train.py", line 120, in <module>
    launch_train_with_config(traincfg, trainer)
  File "/home/nga/FasterRCNN/tensorpack/train/interface.py", line 91, in launch_train_with_config
    model.build_graph, model.get_optimizer)
  File "/home/nga/FasterRCNN/tensorpack/utils/argtools.py", line 176, in wrapper
    return func(*args, **kwargs)
  File "/home/nga/FasterRCNN/tensorpack/train/tower.py", line 224, in setup_graph
    train_callbacks = self._setup_graph(input, get_cost_fn, get_opt_fn)
  File "/home/nga/FasterRCNN/tensorpack/train/trainers.py", line 189, in _setup_graph
    grad_list = self._builder.call_for_each_tower(tower_fn)
  File "/home/nga/FasterRCNN/tensorpack/graph_builder/training.py", line 226, in call_for_each_tower
    use_vs=[False] + [True] * (len(self.towers) - 1))
  File "/home/nga/FasterRCNN/tensorpack/graph_builder/training.py", line 122, in build_on_towers
    return DataParallelBuilder.call_for_each_tower(*args, **kwargs)
  File "/home/nga/FasterRCNN/tensorpack/graph_builder/training.py", line 117, in call_for_each_tower
    ret.append(func())
  File "/home/nga/FasterRCNN/tensorpack/train/tower.py", line 252, in get_grad_fn
    inputs = input.get_input_tensors()
  File "/home/nga/FasterRCNN/tensorpack/input_source/input_source_base.py", line 83, in get_input_tensors
    return self._get_input_tensors()
  File "/home/nga/FasterRCNN/tensorpack/input_source/input_source.py", line 272, in _get_input_tensors
    ret = self.queue.dequeue(name='input_deque')
  File "/home/nga/.local/lib/python3.6/site-packages/tensorflow/python/ops/data_flow_ops.py", line 435, in dequeue
    self._queue_ref, self._dtypes, name=name)
  File "/home/nga/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3741, in queue_dequeue_v2
    timeout_ms=timeout_ms, name=name)
  File "/home/nga/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/nga/.local/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/nga/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/nga/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): FIFOQueue '_0_QueueInput/input_queue' is closed and has insufficient elements (requested 1, current size 0)
	 [[node QueueInput/input_deque (defined at /home/nga/FasterRCNN/tensorpack/input_source/input_source.py:272)  = QueueDequeueV2[component_types=[DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT64, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](QueueInput/input_queue)]]

(2) Other observations, if any: The error occurs at random epoch, sometimes 12, sometimes 6, 9 and so on.

3. What you expected, if not obvious.

How can I resolve this issue?

4. Your environment:

Python==3.6.7
Tensorpack==0.9.4
TensorFlow==1.12.0
msgpack==0.5.6

1 GPU - GeForce GTX 1080 Free RAM 124.25/125.82 GB

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 21

Most upvoted comments

OMG. Then this is a real issue. For now you can unblock yourself by changing the size limit at: https://github.com/tensorpack/tensorpack/blob/42416945c1e36a5f1d4350ee9c1ae8b134cbe841/tensorpack/utils/serialize.py#L19

I’ll see what’s a better way to fix this.