frugally-deep: Slow-ish run time on MSVC
Hi!
First of all thank you for this great library! š
Iāve got a fairly small model (18 layers) for real-time applications, basically mainly consisting of 5 blocks of Conv2D/ReLu/MaxPool2D, and input size 64x64x3. Iām unfortunately seeing some speed problems with fdeep.
A forward pass takes around 11ms in Keras, and itās taking 60ms in fdeep. (Iāve measured by calling predict 100x in a for-loop and then averaging - a bit crude but should do the trick for this purpose). Iāve compiled with the latest VS2017 15.5.5, Release mode, and default compiler flags (/O2). If I enable AVX2 and instrinsics, it goes down to 50ms, but still way too slow. (Iāve tried without im2col but itās even slower, around >10x).
Iāve run the VS profiler, but Iām not 100% sure Iām interpreting the results correctly. I think around 30%+5% of the total time is spent in Eigenās gebp and gemm functions, where we probably canāt do much. Except maybe: I think Iāve seen youāre using RowMajor storage for the Eigen matrices. Eigen is supposedly more optimised for its default, ColMajor storage. Would it be hard to change that in fdeep?
Another 30% seems to be spent in convolve_im2col. But Iām not 100% sure where. I first thought it was the memcpy in eigen_mat_to_values but eigen_mat_to_values itself contains very few profiler samples only.
Thereās also a lot of internal::transform and std::transform showing up in the profiler as well (internal::transform<ContainerOut>(reuse_t{}, f, std::forward<ContainerIn>(xs));) but I couldnāt really figure out what the actual code is that this executes.
I also saw that I think you pre-instantiate some convolution functions for common kernels. Most of my convolution kernels are 3x3, and it looks like you only instantiate n x m kernels for n and m equals 1 and 2. Could it help adding 3x3 there?
So yea Iām really not sure about all of it. If indeed the majority of time is spent in Eigenās functions, then the RowMajor thing could indeed be a major problem.
Iām happy to send you the model and an example input via email if you wanted to have a look.
Hereās some screenshots of the profiler:

Thank you very much!
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 32 (32 by maintainers)
Wow, this is awesome!
Timings for my model with the very latest fdeep commit:
That is really an amazing improvement. Itās obviously still not as fast as Iād like (29ms means ābarely real-timeā (30fps), but this is very hopeful and I think I can work with that for now. š Also because the gcc timings are quite amazing:
-fopenmpit goes to 11ms (but sometimes it spikes up to around 20-30ms - maybe because the OMP run-time is not always āhotā). Same timings with clang-6.0.I played around with the MSVC flags again, but none of them make it any faster (and /openmp makes it slower). And just for reference, itās the same speed with AVX (1) AVX2, so the speed-gain comes from SSE->AVX, and there is none from AVX->AVX2.
I also quickly tested cd90b82f8db385f32082bce457bb42c9795f82a2 (which doesnāt have the tensor-no-copy improvement yet) and I think itās like 1ms or so slower, so it seems indeed like a small improvement. I think this is too small to be measured correctly by my crude benchmark but letās say it is indeed 1ms, then itās actually still a 3% improvement, which is Iād say significant. š
Btw, I the other speed gains in your tables are also really good. You say āsaves some time during prediction, even if it is not very muchā but itās actually around 10-20%!
And regarding this: āinterestingly the implementation have some wiggle room on the details of a convolutionā: I was actually quite surprised (and very happy!) that the prediction values for a test example were pretty much exactly the same for Matconvnet, Keras and fdeep. I think I checked up to 4 or 6 floating point digits. Even though Matlab probably uses quite different padding, convolutions, matrix multiply, etc⦠š
I think this is definitely awesome, thank you very much for these improvements! š š š
Oh, but good news for you (hopefully I didnāt make a mistake while measuring). The time for a forward pass with your model went down from 17 ms to 10 ms on my machine (
-march=native). Could you please check if you see a similar drastic effect with your setup?Edit: If I take a look at your model architecture it however seems plausible, since you are using quite small data tensors (height and width) but a huge number of filters, especially in the last layers. So your forward pass probably likes the im2col filter matrices being precalculated when the model is loaded.
Edit2: Just double-checked. The performance gain for your model is real on my system. When I check out this version 100 forward passes take 1715 ms. With that version it is 988 ms.
With this commit the filter matrices are now pregenerated when loading a model. This saves some time during prediction, even if it is not very much (
-O3only, so no-march=native):