tensorflow: Input ops fed networks operate considerably slower than direct feed ones

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

http://stackoverflow.com/questions/39794149/when-introducing-the-optimizer-variables-under-variable-scope-get-recreated-twic

Environment info

Tried both OS X and Linux (Ubuntu 16) On both utilizing CPU only.

If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)

Any simple operation (i.e. calculating logits with a 3 layer deep network with simple regression in each layer) in which the data was fed with either a parse_example, parse_single_example, building a CustomRunner that feeds a RandomShuffleQueue or utilizing QueueRunner. In all of those the operation and time to complete 1 epoch took considerably more than if I were to save the data in a bumpy array and feed it during a call to sess.run() with the feed_dict.

What other attempted solutions have you tried?

Tried to score the web for solutions and ran into a couple blog posts describing the problem but can’t seem to find solutions.

Logs or other output that would be helpful

Tried this on both TF 0.10 and the latest RC for 0.11.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 20 (10 by maintainers)

Most upvoted comments

Ah, good point, increasing capacity, number of threads, and adding time.sleep in main thread to preload the examples makes preloaded_reader example as fast as feed_dict one.

https://github.com/yaroslavvb/stuff/commit/226099cd883df5aa5b08483581d66960ada90e22

Note that simply increasing capacity/number of threads didn’t help. I suspect Python just ran the main queue until it ran out of examples, and went back to slow behavior with blocking. Possibly this is specific to tiny models like MNIST – when the main computation loop takes 1-2ms, then GIL is only released for 2ms, and there’s not enough time for the data-preloading threads to pre-empt the computation thread, so that thread will always be starved of examples

Also, seems that there’s no way to increase number of threads in slice_input_producer

yaroslavvb on Oct 5, 2016

QueueDequeueMany takes a long time when it’s waiting for enough stuff to get put on the queue. It’s not computation cycles that are slowing it down.

You need multiple reader threads all feeding the queue in order to unblock it. If you’re using tf.train.batch or its friends, in TensorBoard you’ll see that the queue is never full (and it should be). Increase the number of threads you use; this is a simple argument in tf.train.batch and friends.

ebrevdo on Oct 4, 2016