frugally-deep: Slow-ish run time on MSVC

Hi!

First of all thank you for this great library! 😃 I’ve got a fairly small model (18 layers) for real-time applications, basically mainly consisting of 5 blocks of Conv2D/ReLu/MaxPool2D, and input size 64x64x3. I’m unfortunately seeing some speed problems with fdeep. A forward pass takes around 11ms in Keras, and it’s taking 60ms in fdeep. (I’ve measured by calling predict 100x in a for-loop and then averaging - a bit crude but should do the trick for this purpose). I’ve compiled with the latest VS2017 15.5.5, Release mode, and default compiler flags (/O2). If I enable AVX2 and instrinsics, it goes down to 50ms, but still way too slow. (I’ve tried without im2col but it’s even slower, around >10x).

I’ve run the VS profiler, but I’m not 100% sure I’m interpreting the results correctly. I think around 30%+5% of the total time is spent in Eigen’s gebp and gemm functions, where we probably can’t do much. Except maybe: I think I’ve seen you’re using RowMajor storage for the Eigen matrices. Eigen is supposedly more optimised for its default, ColMajor storage. Would it be hard to change that in fdeep? Another 30% seems to be spent in convolve_im2col. But I’m not 100% sure where. I first thought it was the memcpy in eigen_mat_to_values but eigen_mat_to_values itself contains very few profiler samples only. There’s also a lot of internal::transform and std::transform showing up in the profiler as well (internal::transform<ContainerOut>(reuse_t{}, f, std::forward<ContainerIn>(xs));) but I couldn’t really figure out what the actual code is that this executes. I also saw that I think you pre-instantiate some convolution functions for common kernels. Most of my convolution kernels are 3x3, and it looks like you only instantiate n x m kernels for n and m equals 1 and 2. Could it help adding 3x3 there? So yea I’m really not sure about all of it. If indeed the majority of time is spent in Eigen’s functions, then the RowMajor thing could indeed be a major problem.

I’m happy to send you the model and an example input via email if you wanted to have a look.

Here’s some screenshots of the profiler: image image image

Thank you very much!

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 32 (32 by maintainers)

Most upvoted comments

Wow, this is awesome!

Timings for my model with the very latest fdeep commit:

  • No AVX: 38ms on both VS2017 15.5.5 and 15.6.0 Preview 3 (that’s down from 60ms!)
  • With AVX2: 29ms on 15.5.5, 27ms on 15.6.0 Preview 3 (down from 50ms!)

That is really an amazing improvement. It’s obviously still not as fast as I’d like (29ms means ā€œbarely real-timeā€ (30fps), but this is very hopeful and I think I can work with that for now. 😃 Also because the gcc timings are quite amazing:

  • gcc-7: From 21m down to 13ms. With -fopenmp it goes to 11ms (but sometimes it spikes up to around 20-30ms - maybe because the OMP run-time is not always ā€œhotā€). Same timings with clang-6.0.

I played around with the MSVC flags again, but none of them make it any faster (and /openmp makes it slower). And just for reference, it’s the same speed with AVX (1) AVX2, so the speed-gain comes from SSE->AVX, and there is none from AVX->AVX2.

I also quickly tested cd90b82f8db385f32082bce457bb42c9795f82a2 (which doesn’t have the tensor-no-copy improvement yet) and I think it’s like 1ms or so slower, so it seems indeed like a small improvement. I think this is too small to be measured correctly by my crude benchmark but let’s say it is indeed 1ms, then it’s actually still a 3% improvement, which is I’d say significant. 😃

Btw, I the other speed gains in your tables are also really good. You say ā€œsaves some time during prediction, even if it is not very muchā€ but it’s actually around 10-20%!

And regarding this: ā€œinterestingly the implementation have some wiggle room on the details of a convolutionā€: I was actually quite surprised (and very happy!) that the prediction values for a test example were pretty much exactly the same for Matconvnet, Keras and fdeep. I think I checked up to 4 or 6 floating point digits. Even though Matlab probably uses quite different padding, convolutions, matrix multiply, etc… 😃

I think this is definitely awesome, thank you very much for these improvements! šŸŽ‰ šŸŽ‰ šŸŽ‰

Oh, but good news for you (hopefully I didn’t make a mistake while measuring). The time for a forward pass with your model went down from 17 ms to 10 ms on my machine (-march=native). Could you please check if you see a similar drastic effect with your setup?

Edit: If I take a look at your model architecture it however seems plausible, since you are using quite small data tensors (height and width) but a huge number of filters, especially in the last layers. So your forward pass probably likes the im2col filter matrices being precalculated when the model is loaded.

Edit2: Just double-checked. The performance gain for your model is real on my system. When I check out this version 100 forward passes take 1715 ms. With that version it is 988 ms.

With this commit the filter matrices are now pregenerated when loading a model. This saves some time during prediction, even if it is not very much (-O3 only, so no -march=native):

Model before after
InceptionV3 0.61 s 0.54 s
ResNet50 0.41 s 0.35 s
VGG16 1.44 s 1.37 s
VGG19 1.76 s 1.67 s
Xception 1.00 s 0.86 s
DenseNet201 0.57 s 0.49 s
NASNetLarge 3.64 s 3.12 s