tensorflow: FIFOQueue: dequeue many operation very slow?
When training a relatively simple model (1-layer LSTM, 256 units) my Titan X GPU keeps spiking from 0% to 30% GPU utilization. Conclusion: somewhere in the pipeline there is a bottleneck which limits the GPU to be processing the training batches continuously. I use a FIFOQueue to which examples are being fed in one or more separate threads:
queue = tf.FIFOQueue(
capacity=self.config.queue_capacity,
dtypes=[tf.float32, tf.float32],
shapes=[[30, 49, 512], [30]],
name="FIFOQueue"
)
For the training operation I use queue.dequeue_many to get examples from the queue. As you can see the batch size is 64 examples. So in the end the input tensor is 64x30x49x512 of type tf.float32:
# Model inputs, either use the queue (training) or feed_dict (evaluation)
inputs, targets = queue.dequeue_many(64)
To find out why my code is running “slow” (i.e. spiking GPU allocation and no temperature increase) I use the Timeline object (see here) to measure execution times of individual operations. The results displayed below show the measurements for one training iteration at which point the queue was filled with more than 1000 examples. I have included screenshots for both GPU and CPU-only runs (forced with export CUDA_VISIBLE_DEVICES=-1.
What strikes me from these results is that it takes a really long time to dequeue examples from the FIFOQueue. What is happening here…something wrong or is the dequeuing operation just very slow? Overall the dequeuing operation and sending the data to the GPU takes up half of the time of a training iteration. No wonder that the GPU utilization is spiking. Any help is welcome optimizing my training pipeline! As I understand correctly the examples are all queued in RAM, is there also a way to queue them ahead on GPU memory so when they are needed they do not have to be moved CPU => GPU?
This is tested on TensorFlow v9.0 build from sources about 1.5 week ago.
GPU running on Titan X

CPU running on Xeon CPU E5-2640

About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 21 (13 by maintainers)
Oke, I wrote a simple script (see below) in which the threads simple enqueue batches of
np.ones([64, 30, 49, 512], dtype=np.float32)to theFIFOQueue. The main loop simple dequeues 64 examples and performs a simpletf.squareof the input tensor. Before the main loop starts I make sure that the queue is filled with a sufficient number of examples. Again, the dequeue operation takes a “long” time to finish as you can see in the timelines below.GPU running on Titan X
CPU running on Xeon E5-2640
I have an additional observation that might help us debug this issue. The slowdown for me happens when my cache memory gets full. I tried clearing the cache using:
The training then runs perfectly until the cache gets ~3/4 full again (as seen by the yellow bars on
htop), and then it gets slow. So far my solution is to clear cache (requiressudothough) and the training gets fast again. Hoping there’s a better way to fix this!I spent some time looking into what’s going on here…
As I mentioned earlier, your input batch of [64, 30, 49, 512] * tf.float32 equates to 192MB…
First, I established a baseline cost for memcpying a tensor of the size you are feeding. A native C++ program calling memcpy in a tight loop between buffers of this size takes around 22ms per iteration (single threaded). The TensorFlow enqueue/dequeue operations actually use the Eigen library to copy out slices of the tensors and this is likely to be less efficient than a flat memcpy since it needs to handle general shapes of n-D arrays. Theoretically Eigen might be able to do this in parallel, but doesn’t appear to be in this case (from looking at the CPU usage).
Because you are feeding in numpy arrays, and retrieving a large result, you incur a copy in each direction within the session.Run() call, here and here for TensorFlow 0.9.
Note - this code path is slightly different in various versions of TensorFlow. Also, if you ever move to the distributed runtime these 192MB tensors will get serialized as ProtoBufs and this is very expensive.
Also, as @mmry says, some of the CPU time related to enqueue operations is most likely being accounted to the
DequeueManyop due to this code - once the queue is full, enqueue ops are blocked and wait in a list. They can then get executed the next time a dequeue succeeds.Note that a queue with max size of 1000 elements is about 3GB, so depending on your machine config you may be causing a lot of virtual memory (and allocator) pressure. This isn’t happening on my machine, but you may want to check by running
htoporvmstatwhile executing your program. A smaller queue may be more sensible.In general, you may be better off using one of the TensorFlow input ops which reads image data directly from the file sytem, and then do any preprocessing as part of the graph. (as opposed to preprocessing in Python and feeding in the raw tensor data).
For reference, I’ve attached a pprof profile of your code (with a few relatively insignificant changes). You can see that the bulk of the cycles are spent in
TF_Run_wrapper_helper(doing memcpys of the numpy arrays), and underQueueBase::FlushUnlockeddoing both the enqueue and dequeue copy ops via Eigen (one of which ends up turning into amemcpy)EDIT: You will see that there also seems to be a decent amount of time spent in libcuda (for which we have no symbols). This is most likely due to the memory being used for host to device transfers not being “pinned”. This appears to cause the Nvidia driver to throw out the anchors and either copy the data to a DMA’able buffer or pin the relevant pages. Net result is that those transfers also take about the same time as a memcpy and consume a lot of CPU. TensorFlow has heuristics which attempt to allocate tensors in pinned memory when they need to be DMA’d. It may be the case that they are not working well here. @poxvoculi may know more?
Hope this helps … Paul
@rohitgirdhar I’m not 100% sure I want to add to a closed issue, commenting on something which may not even be related to the original problem, but…
From the
htopoutput and yourdrop_cachesobservation it sounds to me like there may be some memory pressure caused by virtual address space fragmentation and high system buffer cache churn (reading large training datasets from the file system). Can you see if your program is causing lots of page faults? e.g. like thisInternally at google we almost always use the TCMalloc memory allocator which appears to be much better for TensorFlow workloads (and most others at google!). When running the open source TensorFlow build via a standard Python binary the only way to use TCMalloc would be to inject a different heap implementation into Python using
LD_PRELOAD, e.g.LD_PRELOAD="/usr/lib/libtcmalloc.so" python myprogram.pyIt would be very useful if you could try running with TCMalloc and see if it makes any difference.
Is there any progress on improving Numpy -> GPU throughput with Tensorflow?
I have a similar benchmark to @tomrunia of passing a O(100MB) tensor from Numpy to GPU and returning a small slice of the tensor. This would ideally be bottlenecked by PCI-e bandwidth, but runs at ~1/10th of that rate. Additionally, I see very poor scaling when attempting to run this tasks over multiple GPUs (on a single node), while theoretically I should benefit from the additional PCI-e bandwidth. (all stats from Tensorflow 0.9 built from source).
@prb12’s description is very useful, but I’d like to understand more of what’s going on when passing through feed_dict, enqueuing, dequeuing. My understanding: When a Numpy tensor is passed through a feed_dict it is copied into a host side Eigen tensor. When a tensor is enqueued, it is added to a queue that lives on the host. When a tensor is dequeued on the GPU, a cudaMemcpy copies the tensor from host to device. This should run near PCI-e bandwidth.
Questions about my understanding: Is enqueueing a tensor done by reference or with a full copy (and memory allocation)? Are the memcpy’s on feed_dict passing (and enqueue if there’s a memcpy there) performed by a single thread for the full Tensorflow session? Even if the
Session.runcalls come from multiple Python threads?Are there plans to fix this, such as a queue that is resident on the GPU?
I think TCMalloc was the solution for me, as I no longer need to clear the cache to maintain the same training speed. Thanks @prb12 !
@rohitgirdhar clearing cache works for me, when i was training inception v3.
@prb12 I tried TCMalloc on AlexNet with 1-4 Pascal-grade GPUs (Titan X, GP100). The queues are doing the threading for the CPU-side JPEG decoding, and since GPUs have very high throughput, threading of the CPU-side code is critical here. TCMalloc does speed things up by 20% for 1-4 threads (inter/intra threads, threads associated with custom queues, etc.), but there’s a break-even point at 8 threads, and beyond that TCMalloc does more harm than good. At 20 threads (e,g, on a 20-core Intel Xeon E7-8870 v4), perf. is actually 40% down by using TCMalloc rather than vanilla malloc in such a heavily threaded environment. So, I doubt that TCMalloc is a universally beneficial solution.