tensorflow: Reading data with queue is even slower than using feed_dict

It seems reading data with feed_dict is inefficient in tensorflow (I found it 3-4 times slower than a theano-based implementation with a totally same network structure), I turn to a queue-based method as official recommendation. However, my experiment result give a even worse performance. Is there anything wrong ?

Related issue

#3377 Moving data from CPU to GPU is slow

Environment info

Operating System: Ubuntu 16.04 CPU: Intel i4790-k GPU: Nvidia GTX 1070 (8 GB) Memory: 16 GB Tensorflow version: 1.0 CUDA version: 8.0 cuDNN version: 5.0

Implementation Example

  1. Using feed_dict
x, y = np.array(...)
in_x, in_y = tf.placeholder(...)
""" f() is some computation in the network, including embedding_lookup, 
bidirectional_dynamic_rnn, dense """
train_op = f(in_x, in_y) 
sess = tf.Session()
for _ in range(num_epochs): # num_epochs is 100 here
    sess.run(train_op, {in_x: x, in_y: y})
  1. Using queue
x, y = np.array(...)
x, y = tf.convert_to_tensor(...)
in_x, in_y = tf.train.slice_input_producer([x, y], num_epochs=100)
in_x, in_y = tf.train.batch([in_x, in_y], batch_size=32)
train_op = f(in_x, in_y)
sess = tf.Session()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess, coord)
try:
    while not coord.should_stop():
        sess.run(train_op)
    except Exception as e:
        coord.request_stop(e)
    finally:
        coord.request_stop()
coord.join(threads)

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

Just want to add something here, I implemented a multiprocess-based data feeding pipeline for multi-task learning. avg. GPU utilization >90% and quad-core CPU utilization >95%. In case anyone interested: https://hanxiao.github.io/2017/07/07/Get-10x-Speedup-in-Tensorflow-Multi-Task-Learning-using-Python-Multiprocessing/

I troubleshooted similar issue here and it was caused by Python thread scheduler choosing a bad strategy. Essentially Python would schedule computation thread that issues a single enqueue call, then this thread would block and have to be pre-empted. Python doesn’t support parallel execution of Python code and pre-emption is slow so this part is a performance hit. Eventually it pre-empts dequeue (main) thread to schedule enqueue thread, which does a single enqueue call before Python decides to give execution back to main thread. This back-and-forth dance introduced 10x slowdown in the pipeline.

One solution is to make sure that you never have queue starvation, for instance, by making enqueue part faster (ie by using enqueue many as here https://github.com/tensorflow/tensorflow/issues/3009 ), making dequeue part slower (ie, add more computation to train_op until it becomes slow enough to be bottleneck), and by letting queue runners add some things to the queue (make queue larger and add time.sleep(1) right after start_queue_runners)