tensorflow: Input ops fed networks operate considerably slower than direct feed ones
What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
Environment info
Tried both OS X and Linux (Ubuntu 16) On both utilizing CPU only.
If possible, provide a minimal reproducible example (We usually don’t have time to read hundreds of lines of your code)
Any simple operation (i.e. calculating logits with a 3 layer deep network with simple regression in each layer) in which the data was fed with either a parse_example, parse_single_example, building a CustomRunner that feeds a RandomShuffleQueue or utilizing QueueRunner. In all of those the operation and time to complete 1 epoch took considerably more than if I were to save the data in a bumpy array and feed it during a call to sess.run() with the feed_dict.
What other attempted solutions have you tried?
Tried to score the web for solutions and ran into a couple blog posts describing the problem but can’t seem to find solutions.
Logs or other output that would be helpful
Tried this on both TF 0.10 and the latest RC for 0.11.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 20 (10 by maintainers)
Ah, good point, increasing capacity, number of threads, and adding
time.sleepin main thread to preload the examples makespreloaded_readerexample as fast as feed_dict one.https://github.com/yaroslavvb/stuff/commit/226099cd883df5aa5b08483581d66960ada90e22
Note that simply increasing capacity/number of threads didn’t help. I suspect Python just ran the main queue until it ran out of examples, and went back to slow behavior with blocking. Possibly this is specific to tiny models like MNIST – when the main computation loop takes 1-2ms, then GIL is only released for 2ms, and there’s not enough time for the data-preloading threads to pre-empt the computation thread, so that thread will always be starved of examples
Also, seems that there’s no way to increase number of threads in
slice_input_producerQueueDequeueManytakes a long time when it’s waiting for enough stuff to get put on the queue. It’s not computation cycles that are slowing it down.You need multiple reader threads all feeding the queue in order to unblock it. If you’re using
tf.train.batchor its friends, in TensorBoard you’ll see that the queue is never full (and it should be). Increase the number of threads you use; this is a simple argument intf.train.batchand friends.